|
BigSync is a tool that backups large files. But how does it work? Let me give you an overview.
What you get when you install BigSync
BigSync consists of a simple Perl script - bigsync, which is installed as bigsplit and bigjoin - and a couple of Perl modules BigSync::Split and BigSync::Join. This means you can use BigSync out of your own Perl script, and you can even extend BigSync with your own code and features by inheriting from these modules. Most people will however just call the Perl scripts bigsplit (in order to backup a file) or bigjoin (in order to restore a file) from the command line.
Be careful: BigSplit does not make sure that the file remains internally consistent while the backup procedure is running (and the backup procedure may run for a while). That means that if you want to create a backup of a large database file or of an image of a virtual machine while these programs are running and accessing the file, then you will probably have to create an LVM snapshot of the partition the files are stored in, otherwise your backup copy will almost surely be corrupt.
Features of BigSync
BigSync comes with a number of interesting features:
- is implemented completely in Perl, full functionality is available as Perl modules
- can use ssh in order to work over the network.
- BigSync compresses the backup files, for virtual machine images, this means 2-3 times less disk space requirements than the original file.
- BigSync can create backups incrementally using hardlinks. If you already have a former backup of a file and you wish to do another backup (because the file has changed since the last backup run), you can specify your old backup and BigSync will try to reuse those parts of the old backup that didn't change since the last run.
- Handles sparse files efficiently
And how does it work?
When BigSync is started, it cuts the file into pieces of 10 megabytes (chunks). For each chunk, an MD5 sum is computed, also, the serial number of the chunk (1, 2, 3, ...., 9999999,...) is determined.
If the MD5 sum shows that this chunk consists of zeros only (which is the case in parts large parts of sparse files), an empty file is saved for this chunk. This leads to very low disk space consumption for sparse files.
If however the file does not consist of zeros only, the file name ("serial number"."MD5 sum") is created, and BigSync determines whether this chunk "address" already exists in the specified former backup (which would correspond to a part of the file that hasn't changed since the last backup, which is a very common case with disk images for example). If such a chunk exists, it is hardlinked into a directory hash structure that ensures that no more than 256 entries are inside any one directory of the backup (Unix filesystems grow very slow once the number of directory entries becomes larger).
Right now, the configuration of the directory hash allows the backup of files and images of up to 4 Terabytes of size. If however no such chunk exists, a new file is created.
Restoring a file simply means reading the files in the directory hash one after the other, unzipping them (unless sparse part) and writing to the file image.
What can you use BigSync for?
You can backup virtually any large file with BigSync. This means of course also that you can backup an entire harddisk by specifying the disk device file. Note however that you cannot backup more than one file with bigsync at a time. If you wish to backup an entire directory hierarchy, you can use the find command in order to find all the large files in this hierarchy; for each file found, use BigSync. The smaller files can be backed up using rsync.
Not the bigsync you were looking for? See "bigsync, the utility for incremental backup of large files to a slow media" by Egor Egorov at http://code.google.com/p/bigsync/ |