I have just finished writing a program that is able to import
mails stored in a MH mailbox to an IMAP server. During the
development I had to get over some obstacles which I want to share
here. But first thing first: The user story for the program.
As
I've mentioned in the article Backup
System my data are scattered across several partitions/HDDs. The
data include my mails and I wanted to get a consolidated archive on
my backup server. So the idea was to implement a program that will
search and delete duplicates on the IMAP server and import mails from
the MH mailboxes that Sylpheed
created. As stated in C
programming by a Java developer I've found a library that will do
the mail access stuff: LibEtPan.
In order to get a good runtime system for the logic I've switched
from C to Go (The language from
Google(TM), I will write a post about it later).
In order to
detect duplicate mails I didn't compare the whole mails but just some
header information. That's because I've found no way to get messages
as bulk but I've found that there is one call to get all header
information of a folder. As far as I can tell this really speeded up
the program. Getting the data of a folder with several thousand mails
was much faster than uploading some hundreds mails. From the header
I've chosen the subject, the message size, the received date, and the
address fields (from, to, cc) to determine the identity of a mail.
Creating a program that uses this information to detect duplicates on
an IMAP server was pretty straightforward.
Before adding the
automatic deleting of mails (which would need very thorough testing)
I wanted to get sure that libEtPan supports importing the mails from
my MH mailboxes. When I've finished the upload code I've made a
testimport to an IMAP test account that had no data in it. The
program printed out no errors and the number of imported mails looked
OK. Then I've decided to test the duplicate recognition with running
the program a second time. Now it shouldn't add any mails. Well,
shouldn't but in reality it uploaded all mails again.
After
adding some debug print statements I've found out that the size of
the emails from the IMAP server differ from the ones from the MH
mailbox. The first one were smaller. The difference wasn't fixed but
it increased with the mail size. I've first guessed that the IMAP
server did some manipulation when storing the mails. In order to
verify that I've checked the stored mails on the server. They had the
size like the one in the MH mailbox. Seems like the get expanded on
delivery. The question was why? While thinking about it I've looked
at the file name of the mails on the IMAP server. They had
S=<number>,W=<number> in their filename. The number after
S was the message size delivered by the MH mailbox while the W number
was the size from the IMAP server. Searching for the S and W numbers
got me this site: http://wiki.dovecot.org/MailboxFormat/Maildir
. In the paragraph "Maildir filename extensions" it
explains that S is the size of the mail on disk while W is the size
of the mail when it is delivered according to RFC822. Then linefeeds
are then represented with CR+LF characters. This means that I must
check mails for single LF characters and add one to the file size for
each one found. In order to do that the whole message must be
fetched. The performance decrease was sensible. And it didn't fix the
issue completely.
After repeating the test imports with the
size fix, the number of duplicate mails decreased a lot but was still
far away from the zero that I needed. The analysis showed that this
was caused by several types of differences. E.g. when a
sender/receiver didn't contain a domain part, MH contained just the
text but the IMAP server added @MISSING_DOMAIN, mails from MH
contained line breaks in the subject which weren't in the ones
fetched via IMAP, and finally white spaces where missing in the mails
retrieved from IMAP via LibEtPan. They existed when retrieved from
the MH mailbox or when fetched via Thunderbird from the IMAP server
but not via LibEtPan.
That were to many and to difficult
problems to fix. The current design was dead. The differences between
the messages returned by MH mailboxes and the one returned by the
IMAP server were to large. But the current code had proved that a
duplicate detection on mails from the IMAP server works. So importing
the mails from one IMAP account to my main account should support
duplicate detection. This gave the design that one part of the
application imports the mail into an intermediate IMAP account. As it
turned out that I have some pretty large trash folders the process
stops in order to allow manual corrections (aka delete trash folder).
Another part of the application takes the corrected mails from the
intermediate account and stores them into the main account while not
importing duplicates. As I had already working code for recognizing
duplicates in one account I've added this also to the
application.
The tests of the new design were successful. The
program didn't import any mails after the initial import. The design
turned out to be very flexible for other sources like mails stored
locally in Thunderbird. Another goodie was that I didn't had to
implement the delete feature.