View Single Post
  #1   (View Single Post)  
Old 24th February 2011
J65nko J65nko is offline
Administrator
 
Join Date: May 2008
Location: Budel - the Netherlands
Posts: 4,128
Default Parsing emails with 'awk' and 'perl'

An 'awk' skeleton script to parse mails to decide what are email header lines, and which lines make up the body of the mail.

Code:
# awk skeleton to parse mails in mbox format
# empty line separates header from body

/^From/, /^$/ {
    printf "\nhead : %s", $0
    next
}

/^$/,/^From/ {
    if ($1 ~ /^From/) next
    printf "\nbody : %s", $0
}
A test run:
Code:
 $ awk -f awk-parse-mails mail-j65                                                     

head : From MAILER-DAEMON Thu Feb 24 01:50:56 2011
head : Date: 24 Feb 2011 01:50:56 +0100
head : From: Mail System Internal Data <MAILER-DAEMON@hercules.utp.xnet>
head : Subject: DON'T DELETE THIS MESSAGE -- FOLDER INTERNAL DATA
head : Message-ID: <1298508656@hercules.utp.xnet>
head : X-IMAP: 1275177528 0000000491
head : Status: RO
head : 
body : 
body : This text is part of the internal format of your mail folder, and is not
body : a real message.  It is created automatically by the mail system software.
body : If deleted, important folder data will be lost, and it will be re-created
body : with the data reset to initial values.
body : 
body : 
head : From j65nko@hercules.utp.xnet Thu Feb 24 03:03:11 2011
head : Received: from hercules.utp.xnet (localhost [127.0.0.1])
head :  by hercules.utp.xnet (8.14.3/8.14.3) with ESMTP id p1O23Bmk005438
head :  for <j65nko@hercules.utp.xnet>; Thu, 24 Feb 2011 03:03:11 +0100 (CET)
head : Received: (from j65nko@localhost)
head :  by hercules.utp.xnet (8.14.3/8.14.3/Submit) id p1O23B1a025655
head :  for j65nko; Thu, 24 Feb 2011 03:03:11 +0100 (CET)
head : Date: Thu, 24 Feb 2011 03:03:11 +0100 (CET)
head : From: j65nko@hercules.utp.xnet
head : Message-Id: <201102240203.p1O23B1a025655@hercules.utp.xnet>
head : To: j65nko@hercules.utp.xnet
head : Subject: apples
head : 
body : 
body : I like to eat apples
body : 
head : From j65nko@hercules.utp.xnet Thu Feb 24 03:03:11 2011
head : Received: from hercules.utp.xnet (localhost [127.0.0.1])
head :  by hercules.utp.xnet (8.14.3/8.14.3) with ESMTP id p1O23B5W023497
head :  for <j65nko@hercules.utp.xnet>; Thu, 24 Feb 2011 03:03:11 +0100 (CET)
head : Received: (from j65nko@localhost)
head :  by hercules.utp.xnet (8.14.3/8.14.3/Submit) id p1O23BHm007707
head :  for j65nko; Thu, 24 Feb 2011 03:03:11 +0100 (CET)
head : Date: Thu, 24 Feb 2011 03:03:11 +0100 (CET)
head : From: j65nko@hercules.utp.xnet
head : Message-Id: <201102240203.p1O23BHm007707@hercules.utp.xnet>
head : To: j65nko@hercules.utp.xnet
head : Subject: oranges
head : 
body : 
body : I like to eat oranges
body : 
head : From j65nko@hercules.utp.xnet Thu Feb 24 03:03:11 2011
head : Received: from hercules.utp.xnet (localhost [127.0.0.1])
head :  by hercules.utp.xnet (8.14.3/8.14.3) with ESMTP id p1O23BXo026743
head :  for <j65nko@hercules.utp.xnet>; Thu, 24 Feb 2011 03:03:11 +0100 (CET)
[snip]
The equivalent version in perl:
Code:
#!/usr/bin/perl

use strict ;
use warnings ;

while (<>) {
    chomp ;
    if (/^From/../^$/) { 
        print "\nhead : $_" ;
        next ;
        }

    if (/^$/.. /^From/) { 
        if (/^From/) { next } ;
        print "\nbody : $_" ;
        }
}
The results are equal, including that spurious empty line at the beginning:
Code:
$ perl-parse-mails mail-j65 >results.perl
$ awk -f awk-parse-mails mail-j65 >results.awk
$ diff results.awk results.perl
$ cat -n results.awk | head -5
     1  
     2  head : From MAILER-DAEMON Thu Feb 24 01:50:56 2011
     3  head : Date: 24 Feb 2011 01:50:56 +0100
     4  head : From: Mail System Internal Data <MAILER-DAEMON@hercules.utp.xnet>
     5  head : Subject: DON'T DELETE THIS MESSAGE -- FOLDER INTERNAL DATA
$
__________________
You don't need to be a genius to debug a pf.conf firewall ruleset, you just need the guts to run tcpdump
Reply With Quote