Wednesday, December 4, 2013

Too Much Data!

Too much data!

Too much data!

“Die Guten ins Töpfchen, die Schlechten ins Kröpfchen” (the good into the pot, the bad into the crop, in Aschenputtel (Cinderella) by Grimm)

1 Situation and Problem

In the past — say 1990s — we had small data files and small disks. I had a Macintosh with 128k! Today we have a great number of large files, photos have typically 4 MB and movies even 4 GB, and much larger hard disks. Gone are magnetic tapes and similar external media (who remembers Magneto-Optical disks?), replaced by external hard disks with USB connectors.
Regular backup is still necessary - even if you use cloud services! The past years I bought about one hard disk per year, with capacity increasing from 300 GB to 3 TB, to hold (differential) backups of all my computers. The utility rdiff-backup did serve me well; files were easy to recover and it never failed me.
The questions for which I did not have an answer: How long to keep the backups? How to use the backups? How to decide that a backup is not relevant anymore and can be discarded. How to know whether a backup contain a valuable file not kept anywhere else. Even if all files are stored, finding a “lost” file is practically impossible.
This year, I decided that I should not simply buy “yet another hard disk” but clean up. A very cursory analysis showed that I had approximatively 10 Terabyte of date on disks; it was mostly copies of the same files. The copies were produced because all my 4 synchronized computers contain more or less the same files and I started a fresh backups every year. The count of duplicates is approximatively the number of years times the number of computers.
The task is simply (1) to extract one copy of each different valuable file and keep it in a safe place and then discard the remainder and (2) invent a policy to avoid creating the same mess again. Step one will be discussed here, step two discussed in a following blog. To start, two difficult definitions are required:
  1. When are two files equal?
  2. What is a valuable file?
The solution will produce a set of valuable files to keep in a “keepFolder”, to replace all the backups. The keepFolder should contain only one copy of each valuable files.

2 Doves: Design for a Cleanup Tool

Tools like Unison and rdiff-backup characterize files first by path and filename (short “filepath”). The effect is that any reorganisation of the file system (e.g., moving or renaming directories) is reflected as delete and addition - bloating the backup and adding each file twice (once as a deleted file, once as a new addition) and two files are only identified as “the same” if they have the same content and the same filepath.
A backup must reflect the current file structure, but for the store of preserved copies of files longtime ago deleted only one copy of each file is sufficient. Equality of files is here defined by content only and independent of filepath.
The cleanup does
  1. create an inventory of files.
  2. identify a set of unique files.
  3. move the unique files to a safe place (the keepFolder).
Two files are equal, if they have the same content, independent of name or location in a directory. Practically, a file is characterized by it MD5 digest, a 128 bit checksum obtained from the file contents. The more secure and collision resistant SHAL1, which gives a 256 bit checksum, could be used, but the goal here is only detection of files with the same content.
Valuable files are defined negatively by a list of file extensions and a list of directories; files with extensions in the list and all files in directories in the list are non-valuable (i.e. files in directories called “Trash”, “tmp” and files with extensions “bak”, “tmp” etc. ). Unfortunately, programmers are inventive with new names for ephemeral files; the XDG ( proposal to select specific places for ephemeral files is an important step in the right direction. Files which are “special” and not readable to compute an MD5 value, e.g. pipes, links, broken pipes and corrupted non-readable files, are also considered non-valuable.

3 Operations of Doves program

The program to collect the unique files is called “doves”. The operations are:
collect (-c): collect the description of files and directories for all files in a given directory and store it in a doves file (extension .doves)
base (-b): extract the MD5 digests from a doves file (i.e. all the file descriptors for a directory) into a md5 file
process (-p): compare a given doves file (i.e. all files in a directory) with the files in base, i.e. a md5 file; the files not in base are listed in a keep file.
keep (-k): copy all the files in the keep file into a keep directory and update the base md5 file with the MD5 digests of the files in the keep file (to avoid these files in the next directory processed).
One starts with collecting the information for all files currently in the keep folder and producing a base md5 file. The sequence of steps: collect, process against base and keep extracts from a directory all files, which are not already in the keep folder and puts them there. It is repeated for each folder of interest.

4 Programming

Most of the tools necessary, in particular functions to computer MD5 digest or SHA1 checksums are available in Haskell from
The difficulty with programming (in Haskell) was my lack of experience with programming for “large” datasets. The “naive” approaches, which rely on the “natural” lazyness of Haskell, work only for reasonable size tasks, but collecting MD5 digests for the million plus files in 500 GB of backup data tends to exhaust resources. A lazy approach opens all files as quickly as possible (breadth first) and crashes when the maximum number of open files is reached (this is in linux typically limited to 1000 open files per process). The Pipes ( package gives a consistent, generic way to sequence actions to avoid exhaustion of resources and leads to a construction of a program as a sequence of smaller actions which can be composed as a pipe. A sequential (sweeping) approach means dealing with each file individually and excludes approaches which want to process all data at once.
I was also not prepared to deal with the many different ways a file can be “special”, i.e. not readable to produce an MD5 value. It was necessary to test for such exceptions at the first point possible and exclude “special files” from further processing

5 Experience

Handling directories with hundreds of gigabytes is time consuming, especially if stored on external hard disks connected by USB 2.0. Just copying data between two disks takes about 4 minutes per GB, reading and computing the MD5 digest is a bit faster. The time it takes to delete 100 GB from a disk can be 10 minutes (and over an hour for 1 TB)
The reduction is substantial. Overall, from the total 10 TB of backup data only 1 TB is left. This is still more than what I expect to keep longterm, which is perhaps 200 GB; my music collection is about 50 GB and photos fill less than 100 GB, the rest is much less.

6 Open Questions

Pictures and music make a very large part of the files we collect today; keeping only one copy of each makes for me about 100 GB of disk space. Unfortunately, it is hard to get rid of all copies, as different music and picture managing programs stick some additional bits of information into the file: tags for pictures and information about music genre for music, and probably other things. The result is that the file is slightly different and has a different fingerprint; it is not recognized as a copy and thus the different near-copies are all stored separately; this is left for another effort in reduction with a different approach, in this pass I did not consider different file types, except to exclude “non-valuable” files.