An 'awk' skeleton script to parse mails to decide what are email header lines, and which lines make up the body of the mail.
Code:
# awk skeleton to parse mails in mbox format
# empty line separates header from body
/^From/, /^$/ {
printf "\nhead : %s", $0
next
}
/^$/,/^From/ {
if ($1 ~ /^From/) next
printf "\nbody : %s", $0
}
A test run:
Code:
$ awk -f awk-parse-mails mail-j65
head : From MAILER-DAEMON Thu Feb 24 01:50:56 2011
head : Date: 24 Feb 2011 01:50:56 +0100
head : From: Mail System Internal Data <MAILER-DAEMON@hercules.utp.xnet>
head : Subject: DON'T DELETE THIS MESSAGE -- FOLDER INTERNAL DATA
head : Message-ID: <1298508656@hercules.utp.xnet>
head : X-IMAP: 1275177528 0000000491
head : Status: RO
head :
body :
body : This text is part of the internal format of your mail folder, and is not
body : a real message. It is created automatically by the mail system software.
body : If deleted, important folder data will be lost, and it will be re-created
body : with the data reset to initial values.
body :
body :
head : From j65nko@hercules.utp.xnet Thu Feb 24 03:03:11 2011
head : Received: from hercules.utp.xnet (localhost [127.0.0.1])
head : by hercules.utp.xnet (8.14.3/8.14.3) with ESMTP id p1O23Bmk005438
head : for <j65nko@hercules.utp.xnet>; Thu, 24 Feb 2011 03:03:11 +0100 (CET)
head : Received: (from j65nko@localhost)
head : by hercules.utp.xnet (8.14.3/8.14.3/Submit) id p1O23B1a025655
head : for j65nko; Thu, 24 Feb 2011 03:03:11 +0100 (CET)
head : Date: Thu, 24 Feb 2011 03:03:11 +0100 (CET)
head : From: j65nko@hercules.utp.xnet
head : Message-Id: <201102240203.p1O23B1a025655@hercules.utp.xnet>
head : To: j65nko@hercules.utp.xnet
head : Subject: apples
head :
body :
body : I like to eat apples
body :
head : From j65nko@hercules.utp.xnet Thu Feb 24 03:03:11 2011
head : Received: from hercules.utp.xnet (localhost [127.0.0.1])
head : by hercules.utp.xnet (8.14.3/8.14.3) with ESMTP id p1O23B5W023497
head : for <j65nko@hercules.utp.xnet>; Thu, 24 Feb 2011 03:03:11 +0100 (CET)
head : Received: (from j65nko@localhost)
head : by hercules.utp.xnet (8.14.3/8.14.3/Submit) id p1O23BHm007707
head : for j65nko; Thu, 24 Feb 2011 03:03:11 +0100 (CET)
head : Date: Thu, 24 Feb 2011 03:03:11 +0100 (CET)
head : From: j65nko@hercules.utp.xnet
head : Message-Id: <201102240203.p1O23BHm007707@hercules.utp.xnet>
head : To: j65nko@hercules.utp.xnet
head : Subject: oranges
head :
body :
body : I like to eat oranges
body :
head : From j65nko@hercules.utp.xnet Thu Feb 24 03:03:11 2011
head : Received: from hercules.utp.xnet (localhost [127.0.0.1])
head : by hercules.utp.xnet (8.14.3/8.14.3) with ESMTP id p1O23BXo026743
head : for <j65nko@hercules.utp.xnet>; Thu, 24 Feb 2011 03:03:11 +0100 (CET)
[snip]
The equivalent version in perl:
Code:
#!/usr/bin/perl
use strict ;
use warnings ;
while (<>) {
chomp ;
if (/^From/../^$/) {
print "\nhead : $_" ;
next ;
}
if (/^$/.. /^From/) {
if (/^From/) { next } ;
print "\nbody : $_" ;
}
}
The results are equal, including that spurious empty line at the beginning:
Code:
$ perl-parse-mails mail-j65 >results.perl
$ awk -f awk-parse-mails mail-j65 >results.awk
$ diff results.awk results.perl
$ cat -n results.awk | head -5
1
2 head : From MAILER-DAEMON Thu Feb 24 01:50:56 2011
3 head : Date: 24 Feb 2011 01:50:56 +0100
4 head : From: Mail System Internal Data <MAILER-DAEMON@hercules.utp.xnet>
5 head : Subject: DON'T DELETE THIS MESSAGE -- FOLDER INTERNAL DATA
$