|
Guides All Guides and HOWTO's. |
|
Thread Tools | Display Modes |
|
|||
Reformatting ftp listings with awk and perl
Reformatting ftp listings with awk and perl
Table of contents
Appendices
1 Introduction While the 'ls' utility provides a wealth of options to display a directory listing, you are left in the cold if you want to customize the format of a directory listing retrieved from a ftp server. I needed a listing sorted by date. The first script is written in 'awk', the 'perl' script accomplishes the same goal. 2 The problem Before I install pre-compiled packages, I need to make sure that the ftp mirror has a complete copy of the package repository at ftp.openbsd.org. Retrieval of such a listing can be done easily from the command line: Code:
$ echo 'ls /pub/OpenBSD/snapshots/packages/i386 packages_snap' | \ ftp -4ai ftp.calyx.nl Code:
ls [remote-directory [local-file]] Print a listing of the contents of a directory on the remote machine. The listing includes any system-dependent informa- tion that the server chooses to include; for example, most UNIX systems will produce output from the command `ls -l'. If remote-directory is left unspecified, the current working directory is used. If interactive prompting is on, ftp will prompt the user to verify that the last argument is indeed the target local file for receiving ls output. If no local file is specified, or if local-file is `-', the output is sent to the terminal. The options: Code:
-4 Forces ftp to use IPv4 addresses only. -a Causes ftp to bypass the normal login procedure and use an anony- mous login instead. -i Turns off interactive prompting during multiple file transfers. Code:
$ head -6 packages_snap total 18002768 -rw-r--r-- 1 276 125 8246 Dec 11 07:35 915resolution-0.5.3.tgz -rw-r--r-- 1 276 125 121402 Dec 11 07:35 9libs-1.0p4.tgz -rw-r--r-- 1 276 125 9874 Dec 11 07:35 9menu-1.7.tgz -rw-r--r-- 1 276 125 22054 Dec 11 07:35 9wm-1.2prep0.tgz -rw-r--r-- 1 276 125 215250 Dec 11 07:35 AcePerl-1.91p0-opt.tgz 3 Using 'awk' In an interview with the Australian Computer World magazine Alfred V. Aho, one of the three architects of the 'awk' programming language, summarizes the language as follows: "AWK is a language for processing files of text. A file is treated as a sequence of records, and by default each line is a record. Each line is broken up into a sequence of fields, so we can think of the first word in a line as the first field, the second word as the second field, and so on. An AWK program is of a sequence of pattern-action statements. AWK reads the input a line at a time. A line is scanned for each pattern in the program, and for each pattern that matches, the associated action is executed." When 'awk' reads a line, these words or fields will be automatically split and stored in numbered variables like $1, $2, $3 etc. Taking a line from a directory listing as example. Code:
-rw-r--r-- 1 276 125 8246 Dec 11 07:35 915resolution-0.5.3.tgz $1 : -rw-r--r-- $2 : 1 $3 : 276 $4 : 125 $5 : 8246 $6 : Dec $7 : 11 $8 : 07:35 $9 : 915resolution-0.5.3.tgz Code:
$ awk '{ print $6, $7, $8, $9, $5 }' test-file Dec 11 07:35 915resolution-0.5.3.tgz 8246 Dec 11 07:35 9libs-1.0p4.tgz 121402 3.1 Associative array or hash table Modern shells do support arrays. For example: Code:
$ month[1]=Jan $ month[2]=Feb $ month[3]=Mar $ for N in 1 2 3 ; do echo Month $N = ${month[$N]} ; done Month 1 = Jan Month 2 = Feb Month 3 = Mar Code:
month["Jan"]="01" month["Feb"]="02" month["Mar"]="03" Code:
$ (echo Nov 10; echo Jan 21) | \ awk 'BEGIN { month["Nov"]="11" ; month["Jan"]="01" } { print month[$1], $2 }' 11 10 01 21 The { print month[$1], $2 } block will be executed for each line, fed to 'awk' For the first line, the 'Nov' field will be assigned to the variable $1. The second field '10' to $2. The command print month[$1] will be expanded to print month["Nov"], which will produce the corresponding value of "11". This same procedure will be repeated for the "Jan 21" line, resulting into the "01 21" output. The complete program so far: Code:
#!/usr/bin/awk -f BEGIN { # associative array mapping month names -> numbers month["Jan"]="01" month["Feb"]="02" month["Mar"]="03" month["Apr"]="04" month["May"]="05" month["Jun"]="06" month["Jul"]="07" month["Aug"]="08" month["Sep"]="09" month["Oct"]="10" month["Nov"]="11" month["Dec"]="12" } { print $6, $7, $8, $9, $5 } Code:
$ ls -l | tail -3 -rw-r--r-- 1 j65nko j65nko 52 Dec 9 01:04 q -rwxr-xr-x 1 j65nko j65nko 330 Dec 16 23:11 reformat.awk -rw-r--r-- 1 j65nko j65nko 387 Jan 31 2007 tmp $ ls -l | tail -3 | reformat.awk Dec 9 01:04 q 52 Dec 16 23:11 reformat.awk 330 Jan 31 2007 tmp 387 3.2 Formatting with printf Both 'awk' and 'sh' know a printf function similar to the one found in the standard C library. To shorten the examples, we will first use the shell's printf. See printf(1) To format a unsigned decimal number we use the %u conversion specification. Between the '%' and 'u' we insert a padding character '0' and the field width '2' so we get %02u. Code:
$ printf "%02u\n" 1 01 $ printf "%02u\n" 12 12 Code:
$ echo Jan 1 | awk '{ printf "%s %02u \n", $1, $2 }' Jan 01 The other problem was the same field could have either a time string of 5 positions, or a year only using 4 positions. By choosing a field width of 5 and a blank for padding we can solve this last issue. Code:
$ printf "% 5u\n" 2007 2007 $ printf "% 5u\n" 12:11 printf: 12:11: not completely converted 12 Code:
$ printf "% +5s\n" 12:11 2007 12:11 2007 File name aligned in a field of 40 positions: Code:
$ printf "%40s\n" "zope-coreblog-1.2p0.tgz" zope-coreblog-1.2p0.tgz Code:
$ $ printf "%-40s\n" "zope-coreblog-1.2p0.tgz" zope-coreblog-1.2p0.tgz Code:
$ printf "%10u\n" 169395 169395 Code:
$ printf "%-40s %10u\n" "zope-coreblog-1.2p0.tgz" "169395" zope-coreblog-1.2p0.tgz 169395 3.3 The complete program in 'awk' Merging all the conversion specifications into one statement, the final printf for 'awk' is: Code:
printf "%s-%02u % 5s %-40s %10u\n", month[$6], $7, $8, $9, $5 %s : prints the number produced by month[$6] - : prints a literal '-' to separate month from day %02u : day of the month as found in $7 : a space % 5s : the time or year from $8 aligned right in field of 5 : a space %-40s : file name from $9, left aligned in a 40 positions wide field : a space %10u : file size in $5, justified right in a 10 position field \n : a newline or linefeed Code:
#!/usr/bin/awk -f # -rw-r--r-- 1 276 125 8251 Dec 13 11:56 915resolution-0.5.3.tgz # -rw-r--r-- 1 276 125 121406 Dec 13 11:56 9libs-1.0p4.tgz # $1 $2 $3 $4 $5 $6 $7 $8 $9 BEGIN { # associative array mapping month names -> numbers month["Jan"]="01" month["Feb"]="02" month["Mar"]="03" month["Apr"]="04" month["May"]="05" month["Jun"]="06" month["Jul"]="07" month["Aug"]="08" month["Sep"]="09" month["Oct"]="10" month["Nov"]="11" month["Dec"]="12" } { #print $6, $7, $8, $9, $5 printf "%s-%02u % 5s %-40s %10u\n", month[$6], $7, $8, $9, $5 } A test, however, reveals a slight problem. Code:
$ head -4 pkg_snap-alberta | reformat.awk -00 0 12-13 11:56 915resolution-0.5.3.tgz 8251 12-13 11:56 9libs-1.0p4.tgz 121406 12-13 11:56 9menu-1.7.tgz 9877 Code:
$ head -4 pkg_snap-alberta total 18016088 -rw-r--r-- 1 276 125 8251 Dec 13 11:56 915resolution-0.5.3.tgz -rw-r--r-- 1 276 125 121406 Dec 13 11:56 9libs-1.0p4.tgz -rw-r--r-- 1 276 125 9877 Dec 13 11:56 9menu-1.7.tgz A simple solution is to just ignore lines which do not have 9 fields. 'awk' has a variable called NF containing the number of fields, and which is updated for each line. Code:
{ if ( NF != 9 ) { next } printf "%s-%02u % 5s %-40s %10u\n", month[$6], $7, $8, $9, $5 } Code:
$ head -4 pkg_snap-alberta | reformat.awk 12-13 11:56 915resolution-0.5.3.tgz 8251 12-13 11:56 9libs-1.0p4.tgz 121406 12-13 11:56 9menu-1.7.tgz 9877 Code:
$ reformat.awk < home/j65nko/packages_snap | sort | head -4 12-13 11:56 915resolution-0.5.3.tgz 8251 12-13 11:56 9libs-1.0p4.tgz 121406 12-13 11:56 9menu-1.7.tgz 9877 12-13 11:56 9wm-1.2prep0.tgz 22061 $ reformat.awk < /home/j65nko/packages_snap | sort | tail -4 12-13 12:03 ztrack-1.0.tgz 8718 12-13 12:03 zziplib-0.13.49.tgz 98528 12-13 12:03 zzuf-0.12.tgz 124942 12-14 04:07 index.txt 117936 3.4 Back to the original problem Extracting the date information, is now simply printing fields 6, 7 and 8: Code:
if ( NF != 9 ) { next } printf "%s-%02u % 5s\n", month[$6], $7, $8 Code:
$ extract-dates.awk pkg_snap-alberta | sort -u 12-13 11:56 12-13 11:57 12-13 11:58 12-13 11:59 12-13 12:00 12-13 12:01 12-13 12:02 12-13 12:03 12-14 04:07 Using 'grep' it is easy to find the files dated '12-14 04:07': Code:
$ reformat.awk pkg_snap-alberta | grep '12-14 04:07' 12-14 04:07 index.txt 117936 Repeating the same procedure for a listing from the ftp.stacken.kth.se mirror: Code:
$ extract-dates.awk pkg_snap-stacken | sort -u | cat -n 1 12-13 18:56 2 12-13 18:57 3 12-13 18:58 4 12-13 18:59 5 12-13 19:00 6 12-13 19:01 7 12-13 19:02 8 12-13 19:03 9 12-14 11:07 A listing of a mirror in the middle of an update: Code:
11-08 17:06 11-08 17:07 11-08 17:08 11-08 17:09 11-08 17:10 11-15 19:47 A ftp site of which the system administrator in an email confirmed to have forgotten the 'rsync' 'delete' option: Code:
01-07 2008 01-22 2008 01-31 15:17 01-31 15:18 01-31 15:19 01-31 15:20 01-31 15:21 01-31 15:22 01-31 15:23 02-16 22:24 02-16 22:25 02-16 22:26 02-16 22:27 02-16 22:28 02-16 22:29 [snip] 07-23 20:02 07-23 20:03 07-24 10:06 08-20 2007 11-18 2007 12-13 2007 12-29 2007 4 A 'perl' approach The 'perl' solution also uses an associative array or hash and is very similar to the 'awk' script. Code:
1 #!/usr/bin/perl -w 2 3 use strict ; 4 5 my %month = qw ( Jan 01 Feb 02 Mar 03 Apr 04 May 05 Jun 06 6 Jul 07 Aug 08 Sep 09 Oct 10 Nov 11 Dec 12 ) ; 7 8 # Format of ftp listing 9 # -rw-r--r-- 1 0 122 109360 Dec 5 13:45 INSTALL.i386 10 # -rw-r--r-- 1 0 122 109360 Jan 5 2007 INSTALL.i386 11 # $perm $blk $user $group $size $m $day $time $name 12 13 sub reformat { 14 my ($perm, $blk, $user, $group, $size, $m, $day, $time, $name) ; 15 my $len = 40 ; 16 17 while ( <> ) { 18 next if /^\s*$/ ; # skip empty or whitespace only lines 19 next if /^total/i ; # skip 'total blocks' line 20 ($perm, $blk, $user, $group, $size, $m, $day, $time, $name) = () ; 21 ($perm, $blk, $user, $group, $size, $m, $day, $time, $name) = split ; 22 printf("%s-%0+2u %+5s %-${len}s %+10u\n", $month{$m}, $day, $time, $name, $size) ; 23 } 24 } 25 26 reformat() ; The 'use strict' pragma forces the programmer to declare all the variables and subroutines he will be using. Lines 5-6 declare and initialize the associative array or hash '%month'. To print the number of the month of July you would have to use: print $month{Jul} . This is not much different from print month["Jul"] in 'awk'. Unlike 'awk', 'perl' does not automatically split lines into variables. One advantage is that we can give meaningful names to the fields. We declare them as local variables to the 'reformat' subroutine in line 14. An alternative would be to use an array, e.g. @fields for storage so we could retrieve the month field with $field[6] Code:
@fields = split ; Lines 17-23 contain a loop statement which reads from standard input. Each line is assigned to the variable $_. This variable, unless overridden, is used as default in many 'perl' contructs. For example the skipping of empty lines or whitespace-only lines as done in line 18, actually is a short cut for next if ( $_ =~ /^\s*$/ ) . A break down of the regular expression: Code:
/ : starting delimiter ^ : beginning of line \s : whitespace characters like spaces, tabs etc * : quantifier indicating zero or one occurrence(s) of the preceding atom '\s' $ : end of line / : closing delimiter The variables to receive the fields are emptied in line 20. The splitting of the line into fields is done with the function 'split'. By default, just like in 'awk', this splitting uses whitespace as separator. It is also another instance where $_ is assumed to hold the text to be split. The 'perl' printf has a similar syntax as the one in 'libc' from the C programming language. A notable exception is that 'perl' allows you to use a variable in the format conversion specifiers, as is done with the variable $len in line 22. Appendices A.1 The 'awk' directory listing reformatter Code:
#!/usr/bin/awk -f # -rw-r--r-- 1 276 125 8251 Dec 13 11:56 915resolution-0.5.3.tgz # -rw-r--r-- 1 276 125 121406 Dec 13 11:56 9libs-1.0p4.tgz # $1 $2 $3 $4 $5 $6 $7 $8 $9 BEGIN { # associative array mapping month names -> numbers month["Jan"]="01" month["Feb"]="02" month["Mar"]="03" month["Apr"]="04" month["May"]="05" month["Jun"]="06" month["Jul"]="07" month["Aug"]="08" month["Sep"]="09" month["Oct"]="10" month["Nov"]="11" month["Dec"]="12" } { if ( NF != 9 ) { next } printf "%s-%02u % 5s %-40s %10u\n", month[$6], $7, $8, $9, $5 } Code:
i=0 for M in Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ; do # month["Jan"]="01" printf "month[\"%s\"]=\"%02u\"\n" $M $((++i)) done A.1.1 A pattern matching variation Code:
#!/usr/bin/awk -f # -rw-r--r-- 1 276 125 8251 Dec 13 11:56 915resolution-0.5.3.tgz # -rw-r--r-- 1 276 125 121406 Dec 13 11:56 9libs-1.0p4.tgz # $1 $2 $3 $4 $5 $6 $7 $8 $9 $6 ~ /Jan/ { $6 = "01" } $6 ~ /Feb/ { $6 = "02" } $6 ~ /Mar/ { $6 = "03" } $6 ~ /Apr/ { $6 = "04" } $6 ~ /May/ { $6 = "05" } $6 ~ /Jun/ { $6 = "06" } $6 ~ /Jul/ { $6 = "07" } $6 ~ /Aug/ { $6 = "08" } $6 ~ /Sep/ { $6 = "09" } $6 ~ /Oct/ { $6 = "10" } $6 ~ /Nov/ { $6 = "11" } $6 ~ /Dec/ { $6 = "12" } { if ( NF != 9 ) { next } printf "%s-%0+2u %+5s %-40s %+10u\n", $6, $7, $8, $9, $5 } The twelve pattern matching statements were generated with a simple shell script. Code:
i=0 for M in Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ; do # $6 ~ /Jan/ { $6 = "01" } printf " \$6 ~ /%s/ { \$6 = \"%02u\" }\n" $M $((++i)) done A.1.2 The 'awk' date extractor Code:
#!/usr/bin/awk -f # -rw-r--r-- 1 276 125 8251 Dec 13 11:56 915resolution-0.5.3.tgz # -rw-r--r-- 1 276 125 121406 Dec 13 11:56 9libs-1.0p4.tgz # $1 $2 $3 $4 $5 $6 $7 $8 $9 BEGIN { # associative array mapping month names -> numbers month["Jan"]="01" month["Feb"]="02" month["Mar"]="03" month["Apr"]="04" month["May"]="05" month["Jun"]="06" month["Jul"]="07" month["Aug"]="08" month["Sep"]="09" month["Oct"]="10" month["Nov"]="11" month["Dec"]="12" } { if ( NF != 9 ) { next } printf "%s-%02u % 5s\n", month[$6], $7, $8 } B.1 The 'perl' reformatter Code:
#!/usr/bin/perl -w use strict ; my %month = qw ( Jan 01 Feb 02 Mar 03 Apr 04 May 05 Jun 06 Jul 07 Aug 08 Sep 09 Oct 10 Nov 11 Dec 12 ) ; # Format of ftp listing # -rw-r--r-- 1 0 122 109360 Dec 5 13:45 INSTALL.i386 # -rw-r--r-- 1 0 122 109360 Jan 5 2007 INSTALL.i386 # $perm $blk $user $group $size $m $day $time $name sub reformat { my ($perm, $blk, $user, $group, $size, $m, $day, $time, $name) ; my $len = 40 ; while ( <> ) { next if /^\s*$/ ; # skip empty or whitespace only lines next if /^total/i ; # skip 'total blocks' line ($perm, $blk, $user, $group, $size, $m, $day, $time, $name) = () ; ($perm, $blk, $user, $group, $size, $m, $day, $time, $name) = split ; printf("%s-%0+2u %+5s %-${len}s %+10u\n", $month{$m}, $day, $time, $name, $size) ; } } reformat() ; $Id: Reformatting.xml,v 1.4 2008/12/23 22:13:06 j65nko Exp $
$Id: book-vbul-html.xsl,v 1.3 2008/12/24 02:59:45 j65nko Exp $ |
Tags |
awk, ftp, ftp directory listing, perl, printf |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
perl expression syntax | qsecofr | Programming | 3 | 16th February 2009 12:56 PM |
Perl locale | Theta | OpenBSD General | 3 | 9th January 2009 01:59 PM |
Learning Perl | mtx | Book reviews | 7 | 22nd October 2008 05:55 PM |
perl/tk | bsdnewbie999 | OpenBSD Packages and Ports | 4 | 8th August 2008 12:34 AM |
Perl Script | c0mrade | Programming | 1 | 26th June 2008 05:04 AM |