Mar 03

Simple mbox eMail reader in Perl

Category: Linux,Perl   — Published by tengo on March 3, 2009 at 7:56 am

Although the email protocol (RFC2822) is one of the oldest of all the "Internet protocols", getting perl to handle it is actually not that simple at first. Heading over to cpan it took me some minutes and a prototype script to get it do what I thought would be a one-liner.

My scenario was that I need to parse and human-readable display the contents of an mbox file. Mbox? Never heard? Well, for hard disk storage of emails, there are a number of formats. Mail transport agents (MTA) need them to temp store mails for delivery or to store them for a user locally. Go and look into your /var/mail/ directory - if you are running postfix or some other MTA there might reside some of these files in this dir.

In a nutshell, there are three major formats: Mbox, Maildir and MH. The docs for mutt (a simple email reader sums up the pro and cons quite well). So which format is used in these files? In essence, if you got one file, then it is Mbox, if you got dirs with lots of files in them, it's either Maildir (more probably) or MH. As I had a single file in my scenario, which consisted of concatenated emails in plain text, I had to deal with the Mbox format.

Over at cpan, there is quite a number of modules centering around email, mbox and the like. One that got my attention was Mail::Mbox::MessageParser. When I constructed a simple app to use it and expected that it would spit out the emails from the mbox file nicely displayed, I was surprised - all that the module does (quite well) is to seperate the mbox file into one-chunk-per-email - nothing more!

So back to cpan, as it was obvious that I needed another module to actually parse a single email into its components, headers and body. What I found was Email::Simple, from a relatively new project to rework the older email tools on cpan. Threw it into the script and it nicely chopped up the email into header-fields and body, but another surpise: no support for message encoding/decoding, MIME-encoding. More elaborately encoded emails came out as garbage. Back to cpan again...

What I found was Email::MIME, also written by Ricardo. It's a wrapper around Email::Simple with support for encodings. A bit more of tweaking (use Encoding) and the script below did what I thought would be a one-liner in perl, reading, parsing  and displaying subject lines from an mbox file:

#!/usr/bin/perl

use Mail::Mbox::MessageParser;
use Email::MIME;
use Data::Dumper;
use Encode;

# Set up cache. (Not necessary if enable_cache is false.)
Mail::Mbox::MessageParser::SETUP_CACHE({
'file_name' => 'mbox.cache'
});

my $folder_reader = new Mail::Mbox::MessageParser({
'file_name' => "/var/mail/some_mbox_file",
'enable_grep' => 1,
});

die $folder_reader unless ref $folder_reader;

# This is the main loop. It's executed once for each email
while(!$folder_reader->end_of_file()){
my $rawemail = $folder_reader->read_next_email();
my $email = Email::MIME->new($rawemail);

print "-----\n";
print "From: ". encode_utf8($email->header('From')) ."\n";
print "To: ". encode_utf8($email->header('To')) ."\n";
print "Subject: ". encode_utf8($email->header('Subject')) ."\n";
}