jLuger.de - Backup System

The starting situation
In the past I've tried a lot of operating systems. Normally I've left the old one as they were and installed a new one into a free partition (that I've got after upgrading the HD). The most important data were copied to the new OS while the rest stayed where they were. I've did this game for some computers until that one day where I've decided that one OS is enough. I've installed it onto a new computer and copied all data that seemed important and which I could grab easily onto it. Since then I've got several old computer and HDs that have old and not backuped data on them. As the HD don't get better over the time and they may contain data that I need some day in the future I wanted a backup solution.

The requirements
First ideas

The last requirement (Data accessible everywhere) caused the most problems as there is only FAT with its 4Gib file size/2TiB partition limit that could be used well on all OS. Some movies and most of virtual machine images are larger than 4 GiB and modern hard drives are larger than 2TiB. Using multiple partitions would only save some years but nothing like a future proof solution. I've done quite some research on it but haven't found a solution. In order to move on I've decided to use the temporary fix. The 4GiB limit in contradiction wasn't that problem as the software could create containers of less size and save its data within it.

The next question was how to get access to the unencrypted data. Truecrypt provides drivers for their encrypted containers. Writing and maintaining drivers for multiple OS isn't something a single person can do in it's free time (at least if you have some other projects/hobbies too). So I needed a solution that works out of the box on all OS. The natural solution for this is WebDav. There are clients on all of OS I use.
While thinking about the idea someone else released a software that used it to encrypt the data of USB sticks. The software is called SecurStick. I've found out about it in a magazine. One issue later readers pointed out that there is a problem with the WebDav client of Windows. See this Knolwedge Base entry. I've decided to drop Windows support first as I didn't use it for quite a time.

I've found a nice WebDav library that provides a servlet that transforms all the WebDav requests to method calls on a an interface. I "only" had to implement the interface and configure the servlet so that my implementation is used. Based on this and a minimal jetty I've created a server that stores files encrypted on hard disk and the the file system in a database (encrypted too).

This seemed to work very well. I could upload a file and after downloading I've seen no difference in the two versions. One day I wanted to present this to a friend and used a music file that should be dragged from the mounted share directly to my mp3 (software) player to demonstrate that you don't need to copy the data unprotected to your hard drive. The player refused to play the file. I've made a diff but the file was returned identically as it was uploaded. To make things worse it worked when apache distributed it via WebDav. Doing some network sniffing it turned out that the player starts loading the file but then quits.

That was the time where I stopped this solution and asked myself if there isn't a Linux only solution that works out of the box. While searching for it I've found lessfs. A fuse based file system that will deduplicate your data. In order to do the deduplication lessfs will split your files into blocks and search for duplicate blocks so that each block will only be stored once. The bad thing is that it uses hashes to recognize duplicate blocks and only hashes. No second verification. When a block of your very important document will have the same hash number like a movie block you've saved before, it won't be stored and on retrieving the document it will contain a part of the movie which will mean that your document is corrupted. I have to admit that the chance is very very very low that this will happen but it exists.

While doing some more research about lessfs I've found a suggestion to use ZFS. ZFS supports data deduplication with a verify option and encryption. Sounds great? Well, ZFS has many versions and you need a pretty recent one to get encryption and deduplication. Solaris was no longer freely available and the free version had a pretty unsecure future.

The solution

I've decided to stay with a deduplicating file system as it would solve easily some of my requirements. After some research I've figured out that when I drop encryption by ZFS that I could use ZFS-fuse on Linux or ZFS in FreeBSD 9 (not stable at that time). The ZFS-fuse was called to be very very slow and I can confirm this. But there is still the option to upgrade to FreeBSD 9 as soon as it is available. So for the start ZFS-fuse was OK.
Of course a file system needs an OS to run. Using a virtual machine for this has several advantages: The following picture shows the architecture of the solution:
See description of the architecture below.
The hard drive (HD) of the virtual PC is split into an unencrypted boot partition and a partition managed by LVM. The unencrypted partition is needed for booting but unfortunately it is also a great source for intruders. Whenever you leave control over your virtual PC check this partition for modifications. Someone may installed a root kit, etc.
The usage of LVM is due to the fact that it is easier to set up LVM AND encryption than just to set up encryption on ubuntu. But it turned out that it has another great advantage. You can easily increase the size of the virtual HD when you get more physical HD space. Increasing the size of the primary virtual HD isn't easy, at least I haven't found an easy way. Copying the content from one virtual HD to another would currently take significantly longer than copying the virtual HD itself.
The encrypted partition is split into a system part where the guest OS is installed and a data part that is managed by ZFS-fuse.

Disadvantages
Before I write about the configuration I want first mention the disadvantages of this solution. Configuration of the server
Now the server is configured but quite useless as there is no way to exchange data. To exchange data I use one of the following three services: Setting up an imap server isn't an easy task normally but in this case it's brain dead. All you need to do is to install the dovecot package and edit one line in /etc/dovecot/dovecot.conf. The line is about the entry mail_location. In the default config it is commented out. Remove the commend and provide a location where the mails should be stored. The location should be managed by ZFS and every user (with imap access) needs to get write permission (read of course too). Sounds scary? Well, there is the %u variable that expands to the username. So you can give each user his/hear own directory. Restart the service and each unix user on your server has an imap account. You can also use SSL connection when you are willing to accept the self created certificate.

To import data I've installed the rsync daemon. The advantage of rsync is that you can resume the importing of a partition. Just start the rsync again and it will continue where it stopped. Or where it detects changed files. You don't have to transfer one gig of a working directory for five MB changes. To configure the daemon see this tutorial: https://help.ubuntu.com/community/rsync. It is quite good except one issue. In the file /etc/rsyncd.conf use root for uid, or else the rsync daemon can't set the file permissions on the imports correctly.

What's next? Currently nothing as I'm busy to import partitions. But the backup server could be easily extended to host a Source Code Version Control System, an SMB server, an Appletalk server, or anything else.