jLuger.de - Detecting duplicates when importing mails from MH mailbox to IMAP server with LibEtPan

I have just finished writing a program that is able to import mails stored in a MH mailbox to an IMAP server. During the development I had to get over some obstacles which I want to share here. But first thing first: The user story for the program.

As I've mentioned in the article Backup System my data are scattered across several partitions/HDDs. The data include my mails and I wanted to get a consolidated archive on my backup server. So the idea was to implement a program that will search and delete duplicates on the IMAP server and import mails from the MH mailboxes that Sylpheed created. As stated in C programming by a Java developer I've found a library that will do the mail access stuff: LibEtPan. In order to get a good runtime system for the logic I've switched from C to Go (The language from Google(TM), I will write a post about it later).

In order to detect duplicate mails I didn't compare the whole mails but just some header information. That's because I've found no way to get messages as bulk but I've found that there is one call to get all header information of a folder. As far as I can tell this really speeded up the program. Getting the data of a folder with several thousand mails was much faster than uploading some hundreds mails. From the header I've chosen the subject, the message size, the received date, and the address fields (from, to, cc) to determine the identity of a mail. Creating a program that uses this information to detect duplicates on an IMAP server was pretty straightforward.

Before adding the automatic deleting of mails (which would need very thorough testing) I wanted to get sure that libEtPan supports importing the mails from my MH mailboxes. When I've finished the upload code I've made a testimport to an IMAP test account that had no data in it. The program printed out no errors and the number of imported mails looked OK. Then I've decided to test the duplicate recognition with running the program a second time. Now it shouldn't add any mails. Well, shouldn't but in reality it uploaded all mails again.

After adding some debug print statements I've found out that the size of the emails from the IMAP server differ from the ones from the MH mailbox. The first one were smaller. The difference wasn't fixed but it increased with the mail size. I've first guessed that the IMAP server did some manipulation when storing the mails. In order to verify that I've checked the stored mails on the server. They had the size like the one in the MH mailbox. Seems like the get expanded on delivery. The question was why? While thinking about it I've looked at the file name of the mails on the IMAP server. They had S=<number>,W=<number> in their filename. The number after S was the message size delivered by the MH mailbox while the W number was the size from the IMAP server. Searching for the S and W numbers got me this site: http://wiki.dovecot.org/MailboxFormat/Maildir . In the paragraph "Maildir filename extensions" it explains that S is the size of the mail on disk while W is the size of the mail when it is delivered according to RFC822. Then linefeeds are then represented with CR+LF characters. This means that I must check mails for single LF characters and add one to the file size for each one found. In order to do that the whole message must be fetched. The performance decrease was sensible. And it didn't fix the issue completely.

After repeating the test imports with the size fix, the number of duplicate mails decreased a lot but was still far away from the zero that I needed. The analysis showed that this was caused by several types of differences. E.g. when a sender/receiver didn't contain a domain part, MH contained just the text but the IMAP server added @MISSING_DOMAIN, mails from MH contained line breaks in the subject which weren't in the ones fetched via IMAP, and finally white spaces where missing in the mails retrieved from IMAP via LibEtPan. They existed when retrieved from the MH mailbox or when fetched via Thunderbird from the IMAP server but not via LibEtPan.

That were to many and to difficult problems to fix. The current design was dead. The differences between the messages returned by MH mailboxes and the one returned by the IMAP server were to large. But the current code had proved that a duplicate detection on mails from the IMAP server works. So importing the mails from one IMAP account to my main account should support duplicate detection. This gave the design that one part of the application imports the mail into an intermediate IMAP account. As it turned out that I have some pretty large trash folders the process stops in order to allow manual corrections (aka delete trash folder). Another part of the application takes the corrected mails from the intermediate account and stores them into the main account while not importing duplicates. As I had already working code for recognizing duplicates in one account I've added this also to the application.

The tests of the new design were successful. The program didn't import any mails after the initial import. The design turned out to be very flexible for other sources like mails stored locally in Thunderbird. Another goodie was that I didn't had to implement the delete feature.