DaemonForums  

Go Back   DaemonForums > Miscellaneous > Guides

Guides All Guides and HOWTO's.

 
 
Thread Tools Display Modes
Prev Previous Post   Next Post Next
  #1   (View Single Post)  
Old 24th February 2011
J65nko J65nko is offline
Administrator
 
Join Date: May 2008
Location: Budel - the Netherlands
Posts: 4,128
Default Parsing emails with 'awk' and 'perl'

An 'awk' skeleton script to parse mails to decide what are email header lines, and which lines make up the body of the mail.

Code:
# awk skeleton to parse mails in mbox format
# empty line separates header from body

/^From/, /^$/ {
    printf "\nhead : %s", $0
    next
}

/^$/,/^From/ {
    if ($1 ~ /^From/) next
    printf "\nbody : %s", $0
}
A test run:
Code:
 $ awk -f awk-parse-mails mail-j65                                                     

head : From MAILER-DAEMON Thu Feb 24 01:50:56 2011
head : Date: 24 Feb 2011 01:50:56 +0100
head : From: Mail System Internal Data <MAILER-DAEMON@hercules.utp.xnet>
head : Subject: DON'T DELETE THIS MESSAGE -- FOLDER INTERNAL DATA
head : Message-ID: <1298508656@hercules.utp.xnet>
head : X-IMAP: 1275177528 0000000491
head : Status: RO
head : 
body : 
body : This text is part of the internal format of your mail folder, and is not
body : a real message.  It is created automatically by the mail system software.
body : If deleted, important folder data will be lost, and it will be re-created
body : with the data reset to initial values.
body : 
body : 
head : From j65nko@hercules.utp.xnet Thu Feb 24 03:03:11 2011
head : Received: from hercules.utp.xnet (localhost [127.0.0.1])
head :  by hercules.utp.xnet (8.14.3/8.14.3) with ESMTP id p1O23Bmk005438
head :  for <j65nko@hercules.utp.xnet>; Thu, 24 Feb 2011 03:03:11 +0100 (CET)
head : Received: (from j65nko@localhost)
head :  by hercules.utp.xnet (8.14.3/8.14.3/Submit) id p1O23B1a025655
head :  for j65nko; Thu, 24 Feb 2011 03:03:11 +0100 (CET)
head : Date: Thu, 24 Feb 2011 03:03:11 +0100 (CET)
head : From: j65nko@hercules.utp.xnet
head : Message-Id: <201102240203.p1O23B1a025655@hercules.utp.xnet>
head : To: j65nko@hercules.utp.xnet
head : Subject: apples
head : 
body : 
body : I like to eat apples
body : 
head : From j65nko@hercules.utp.xnet Thu Feb 24 03:03:11 2011
head : Received: from hercules.utp.xnet (localhost [127.0.0.1])
head :  by hercules.utp.xnet (8.14.3/8.14.3) with ESMTP id p1O23B5W023497
head :  for <j65nko@hercules.utp.xnet>; Thu, 24 Feb 2011 03:03:11 +0100 (CET)
head : Received: (from j65nko@localhost)
head :  by hercules.utp.xnet (8.14.3/8.14.3/Submit) id p1O23BHm007707
head :  for j65nko; Thu, 24 Feb 2011 03:03:11 +0100 (CET)
head : Date: Thu, 24 Feb 2011 03:03:11 +0100 (CET)
head : From: j65nko@hercules.utp.xnet
head : Message-Id: <201102240203.p1O23BHm007707@hercules.utp.xnet>
head : To: j65nko@hercules.utp.xnet
head : Subject: oranges
head : 
body : 
body : I like to eat oranges
body : 
head : From j65nko@hercules.utp.xnet Thu Feb 24 03:03:11 2011
head : Received: from hercules.utp.xnet (localhost [127.0.0.1])
head :  by hercules.utp.xnet (8.14.3/8.14.3) with ESMTP id p1O23BXo026743
head :  for <j65nko@hercules.utp.xnet>; Thu, 24 Feb 2011 03:03:11 +0100 (CET)
[snip]
The equivalent version in perl:
Code:
#!/usr/bin/perl

use strict ;
use warnings ;

while (<>) {
    chomp ;
    if (/^From/../^$/) { 
        print "\nhead : $_" ;
        next ;
        }

    if (/^$/.. /^From/) { 
        if (/^From/) { next } ;
        print "\nbody : $_" ;
        }
}
The results are equal, including that spurious empty line at the beginning:
Code:
$ perl-parse-mails mail-j65 >results.perl
$ awk -f awk-parse-mails mail-j65 >results.awk
$ diff results.awk results.perl
$ cat -n results.awk | head -5
     1  
     2  head : From MAILER-DAEMON Thu Feb 24 01:50:56 2011
     3  head : Date: 24 Feb 2011 01:50:56 +0100
     4  head : From: Mail System Internal Data <MAILER-DAEMON@hercules.utp.xnet>
     5  head : Subject: DON'T DELETE THIS MESSAGE -- FOLDER INTERNAL DATA
$
__________________
You don't need to be a genius to debug a pf.conf firewall ruleset, you just need the guts to run tcpdump
Reply With Quote
 

Tags
awk, mbox format, parsing mail, perl

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Perl 5.12.3 released J65nko News 0 26th January 2011 11:00 AM
Perl locale Theta OpenBSD General 3 9th January 2009 01:59 PM
Learning Perl mtx Book reviews 7 22nd October 2008 05:55 PM
perl/tk bsdnewbie999 OpenBSD Packages and Ports 4 8th August 2008 12:34 AM
Perl Script c0mrade Programming 1 26th June 2008 05:04 AM


All times are GMT. The time now is 08:27 AM.


Powered by vBulletin® Version 3.8.4
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Content copyright © 2007-2010, the authors
Daemon image copyright ©1988, Marshall Kirk McKusick