In the past I've tried a lot of operating systems. Normally I've left the old one as they were and installed a new one into a free partition (that I've got after upgrading the HD). The most important data were copied to the new OS while the rest stayed where they were. I've did this game for some computers until that one day where I've decided that one OS is enough. I've installed it onto a new computer and copied all data that seemed important and which I could grab easily onto it. Since then I've got several old computer and HDs that have old and not backuped data on them. As the HD don't get better over the time and they may contain data that I need some day in the future I wanted a backup solution.
The requirements
- Backing up the old installations as the are: While most of the data are in the home directory some software is in usr/local and there are also some configuration tricks that may come in handy. Additionally for some files I only remember what the are/where I've got them from when I know the location and OS where they are stored.
- A folder system: I need a folder system to easily find my data.
- Deduplication: As I've written above I have multiple copies of my important files. While multiple copies of office files don't take much storage space my pictures (1GB to 10 GB per journey) and my music does. Deduplication will also solve the conflict between keeping the installations as they are and a folder system. I can have a huge amount of files twice (or more times) but still need to save them only once.
- Encryption: I've got a lot of private data on my HDs. As long as they are in my own flat the risk is acceptable. But a good backup system means that the data are also stored outside your flat (think of fire in your flat). While it increases data availability it decreases control over your data. Encryption is one way to get it back.
- Data accessible on every computer: I've still got several computer with different OS. Of course I want my data available on all of them.
The last requirement (Data accessible everywhere) caused the most problems as there is only FAT with its 4Gib file size/2TiB partition limit that could be used well on all OS. Some movies and most of virtual machine images are larger than 4 GiB and modern hard drives are larger than 2TiB. Using multiple partitions would only save some years but nothing like a future proof solution. I've done quite some research on it but haven't found a solution. In order to move on I've decided to use the temporary fix. The 4GiB limit in contradiction wasn't that problem as the software could create containers of less size and save its data within it.
The next question was how to get access to the unencrypted data. Truecrypt provides drivers for their encrypted containers. Writing and maintaining drivers for multiple OS isn't something a single person can do in it's free time (at least if you have some other projects/hobbies too). So I needed a solution that works out of the box on all OS. The natural solution for this is WebDav. There are clients on all of OS I use.
While thinking about the idea someone else released a software that used it to encrypt the data of USB sticks. The software is called SecurStick. I've found out about it in a magazine. One issue later readers pointed out that there is a problem with the WebDav client of Windows. See this Knolwedge Base entry. I've decided to drop Windows support first as I didn't use it for quite a time.
I've found a nice WebDav library that provides a servlet that transforms all the WebDav requests to method calls on a an interface. I "only" had to implement the interface and configure the servlet so that my implementation is used. Based on this and a minimal jetty I've created a server that stores files encrypted on hard disk and the the file system in a database (encrypted too).
This seemed to work very well. I could upload a file and after downloading I've seen no difference in the two versions. One day I wanted to present this to a friend and used a music file that should be dragged from the mounted share directly to my mp3 (software) player to demonstrate that you don't need to copy the data unprotected to your hard drive. The player refused to play the file. I've made a diff but the file was returned identically as it was uploaded. To make things worse it worked when apache distributed it via WebDav. Doing some network sniffing it turned out that the player starts loading the file but then quits.
That was the time where I stopped this solution and asked myself if there isn't a Linux only solution that works out of the box. While searching for it I've found lessfs. A fuse based file system that will deduplicate your data. In order to do the deduplication lessfs will split your files into blocks and search for duplicate blocks so that each block will only be stored once. The bad thing is that it uses hashes to recognize duplicate blocks and only hashes. No second verification. When a block of your very important document will have the same hash number like a movie block you've saved before, it won't be stored and on retrieving the document it will contain a part of the movie which will mean that your document is corrupted. I have to admit that the chance is very very very low that this will happen but it exists.
While doing some more research about lessfs I've found a suggestion to use ZFS. ZFS supports data deduplication with a verify option and encryption. Sounds great? Well, ZFS has many versions and you need a pretty recent one to get encryption and deduplication. Solaris was no longer freely available and the free version had a pretty unsecure future.
The solution
I've decided to stay with a deduplicating file system as it would solve easily some of my requirements. After some research I've figured out that when I drop encryption by ZFS that I could use ZFS-fuse on Linux or ZFS in FreeBSD 9 (not stable at that time). The ZFS-fuse was called to be very very slow and I can confirm this. But there is still the option to upgrade to FreeBSD 9 as soon as it is available. So for the start ZFS-fuse was OK.
Of course a file system needs an OS to run. Using a virtual machine for this has several advantages:
- No trouble when updating you host software. As long as the server needs no contact to the Internet or even some other computer than your host the security impact of old software is very low. Of course not updating should only be done when the updates will crash your data.
- Just copy some files and you've got a test system. Ideal for testing updates.
- Every computer that can execute the virtual machine could access the data.
- Native file systems for Windows (install Samba), Mac OSX (install Netatalk), or Linux (NFS, or sshf). When Webclients should access the data install Apache with WebDav.
The hard drive (HD) of the virtual PC is split into an unencrypted boot partition and a partition managed by LVM. The unencrypted partition is needed for booting but unfortunately it is also a great source for intruders. Whenever you leave control over your virtual PC check this partition for modifications. Someone may installed a root kit, etc.
The usage of LVM is due to the fact that it is easier to set up LVM AND encryption than just to set up encryption on ubuntu. But it turned out that it has another great advantage. You can easily increase the size of the virtual HD when you get more physical HD space. Increasing the size of the primary virtual HD isn't easy, at least I haven't found an easy way. Copying the content from one virtual HD to another would currently take significantly longer than copying the virtual HD itself.
The encrypted partition is split into a system part where the guest OS is installed and a data part that is managed by ZFS-fuse.
Disadvantages
Before I write about the configuration I want first mention the disadvantages of this solution.
- The I/O is pretty slow. Writing is about 5GB/h (yes, per hour). I hope that the ZFS in FreeBSD will be faster.
- Currently no way to start it from command line. Of course VirtualBox offers ways to start a VM from command line but then you have no way to enter the password for unencrypting the HD. So you need a X window system where you can focus the window of the virtual PC in order to enter the password.
- Download ubuntu server.
- Create a new virtual machine in VirtualBox. Choose any name you like, OS is Linux, Version is Ubuntu.
- Give at least 1G RAM. You don't want a virtual machine to start swapping.
- Create a new hard drive. Per default a dynamic growing HD is selected. Don't get tempted to give it a larger size than you have physical space available. It will hit you back some time when your storage takes up all available place on your real HD.
- If the server should be accessible
from other machines than the host only configure the network using the
documentation http://www.virtualbox.org/manual/ch06.html#idp12173808.
I've
decided
that
all
my
virtual
machines
should
be
in
a
private
class
A
network. To realize this I've created a tun/tap interface with
tunctl. The virtual server that should be accessible from the outside
got numbers from 10.1.1.2 on. The following scrip sets up the
translations of private class A to private class C network adresses:
#!/bin/bash
tunctl -t vbox0 -u joerg
ifconfig vbox0 10.1.1.1 up
for i in {1..10}
do
ifconfig eth0:$i 192.168.0.`expr 200 + $i` up
done
echo "1" > /proc/sys/net/ipv4/ip_forward
iptables -P INPUT ACCEPT
iptables -F INPUT
iptables -P OUTPUT ACCEPT
iptables -F OUTPUT
iptables -P FORWARD ACCEPT
iptables -F FORWARD
iptables -t nat -F
for i in {1..10}
do
iptables -t nat -A PREROUTING -d 192.168.0.`expr 200 + $i` -j DNAT --to-destination 10.1.1.`expr $i + 1`
iptables -t nat -A POSTROUTING -s 10.1.1.`expr $i + 1` -j SNAT --to-source 192.168.0.`expr 200 + $i`
done
iptables -A FORWARD -s 0/0 -i eth0 -o vbox0 -d 10.1.1.0/24 -j ACCEPT
iptables -A FORWARD -i vbox0 -o eth0 -j ACCEPT
The for loop works with the bash. So don't start the script with sh but either bash or make it executable. - Under storage settings connect the CD drive to the downloaded ubuntu server image.
- For installation of ubuntu I've used this documentation. It is in German and there is no translation to English but you may try this one.
- The important point is that I have not set a file system/mount point for the data partition. This will be done via the ZFS commands.
- For the rest of the instructions I assume that the unparitioned disk space could be accessed via dev/mapper/fs_one_volumne-data. You may wonder why it isn't a /dev/sdaX. That's because its an LVM managed partition in an encrypted container and each of this layers is accessible via different device file.
- At this point you should export the
encryption headers of the HD via: cryptsetup
luksHeaderBackup --header-backup-file header_backup /dev/sda5
This is needed for repairing the hard drive when there is an error in the file. Encrypted drives have the drawback that recovery isn't possible when the encryption header is corrupted. The command above creates a backup. Don't forget to move the backup to a secure place (the encrypted hard drive isn't one). - To start with ZFS administration read the tutorial at http://flux.org.uk/howto/solaris/zfs_tutorial_01 (don't miss the second page).
- Create a pool with: zpool create data /dev/mapper/fs_one_volumne-data
- Now activate deduplication for the pool: zfs set dedup=verify data
- The newly created pool is accessible under /data. You will also see it when you execute the df command.
- In theory the ZFS maintained disk is now usable but you should better create some file systems. To create the filesystem data on the pool data execute: zfs create data/data
- The result looks like a normal folder. After all it is accessible via /data/data and all filesystem under a pool share the same disk space. But you can do more. E.g. set another mount point: zfs set mountpoint=/home/joerg/data data/data
- This may be only a goodie for you
but not a necessary feature for operating. But you can also create
snapshots: zfs snapshot
data/data@snapshot_20110702
Snapshots save the disk content at the time they were made. When you want to get a file that was deleted/changed afterwards, you have to get the whole filesystem. So you better create one archive file system and several working filesystem. The working file system contain projects with a high chance of restoring needed. So you better keep them small.
- ssh on the server and sshfs on the client to access the files on the linux host.
- dovecot as imap server
- rsync daemon to import the data
To import data I've installed the rsync daemon. The advantage of rsync is that you can resume the importing of a partition. Just start the rsync again and it will continue where it stopped. Or where it detects changed files. You don't have to transfer one gig of a working directory for five MB changes. To configure the daemon see this tutorial: https://help.ubuntu.com/community/rsync. It is quite good except one issue. In the file /etc/rsyncd.conf use root for uid, or else the rsync daemon can't set the file permissions on the imports correctly.
What's next? Currently nothing as I'm busy to import partitions. But the backup server could be easily extended to host a Source Code Version Control System, an SMB server, an Appletalk server, or anything else.