Reformatting ftp listings with awk and perl
Table of contents
- 1 Introduction
- 2 The problem
- 3 Using 'awk'
- 3.1 Associative array or hash table
- 3.2 Formatting with printf
- 3.3 The complete program in 'awk'
- 3.4 Back to the original problem
- 4 A 'perl' approach
Appendices
- A.1 The 'awk' directory listing reformatter
- A.1.1 A pattern matching variation
- A.1.2 The 'awk' date extractor
- B.1 The 'perl' reformatter
1 Introduction
While the
'ls' utility provides a wealth of options to display a directory listing, you are left in the cold if you want to customize the format of a directory listing retrieved from a ftp server.
I needed a listing sorted by date.
The first script is written in
'awk', the
'perl' script accomplishes the same goal.
2 The problem
Before I install pre-compiled packages, I need to make sure that the ftp mirror has a complete copy of the package repository at ftp.openbsd.org.
Retrieval of such a listing can be done easily from the command line:
Code:
$ echo 'ls /pub/OpenBSD/snapshots/packages/i386 packages_snap' | \
ftp -4ai ftp.calyx.nl
The
ls as described in
ftp(1):
Code:
ls [remote-directory [local-file]]
Print a listing of the contents of a directory on the remote
machine. The listing includes any system-dependent informa-
tion that the server chooses to include; for example, most
UNIX systems will produce output from the command `ls -l'.
If remote-directory is left unspecified, the current working
directory is used. If interactive prompting is on, ftp will
prompt the user to verify that the last argument is indeed
the target local file for receiving ls output. If no local
file is specified, or if local-file is `-', the output is
sent to the terminal.
The file
'packages_snap' is the local file to store the output.
The options:
Code:
-4 Forces ftp to use IPv4 addresses only.
-a Causes ftp to bypass the normal login procedure and use an anony-
mous login instead.
-i Turns off interactive prompting during multiple file transfers.
Although only mentioning multiple file transfer prompts, the
-i option also turns off the prompt, asking whether
local-file is indeed the file to receive the
ls output.
Code:
$ head -6 packages_snap
total 18002768
-rw-r--r-- 1 276 125 8246 Dec 11 07:35 915resolution-0.5.3.tgz
-rw-r--r-- 1 276 125 121402 Dec 11 07:35 9libs-1.0p4.tgz
-rw-r--r-- 1 276 125 9874 Dec 11 07:35 9menu-1.7.tgz
-rw-r--r-- 1 276 125 22054 Dec 11 07:35 9wm-1.2prep0.tgz
-rw-r--r-- 1 276 125 215250 Dec 11 07:35 AcePerl-1.91p0-opt.tgz
Having a local copy of the package directory listing, then I would need to extract the month, day and time fields and write them to a temporary file. The last step is sorting with
'sort -u' to produce a listing of the unique dates.
3 Using 'awk'
In an
interview with the Australian Computer World magazine Alfred V. Aho, one of the three architects of the
'awk' programming language, summarizes the language as follows:
"AWK is a language for processing files of text. A file is treated as a sequence of records, and by default each line is a record. Each line is broken up into a sequence of fields, so we can think of the first word in a line as the first field, the second word as the second field, and so on. An AWK program is of a sequence of pattern-action statements. AWK reads the input a line at a time. A line is scanned for each pattern in the program, and for each pattern that matches, the associated action is executed."
When
'awk' reads a line, these words or fields will be automatically split and stored in numbered variables like $1, $2, $3 etc. Taking a line from a directory listing as example.
Code:
-rw-r--r-- 1 276 125 8246 Dec 11 07:35 915resolution-0.5.3.tgz
$1 : -rw-r--r--
$2 : 1
$3 : 276
$4 : 125
$5 : 8246
$6 : Dec
$7 : 11
$8 : 07:35
$9 : 915resolution-0.5.3.tgz
Being interested in the month ($6), day ($7), time ($8), name ($9) and size ($5) fields, a basic
'awk' script would look like this:
Code:
$ awk '{ print $6, $7, $8, $9, $5 }' test-file
Dec 11 07:35 915resolution-0.5.3.tgz 8246
Dec 11 07:35 9libs-1.0p4.tgz 121402
The only thing left is to convert the name of months like Nov and Dec into their ordinals 11 and 12.
3.1 Associative array or hash table
Modern shells do support arrays. For example:
Code:
$ month[1]=Jan
$ month[2]=Feb
$ month[3]=Mar
$ for N in 1 2 3 ; do echo Month $N = ${month[$N]} ; done
Month 1 = Jan
Month 2 = Feb
Month 3 = Mar
Unfortunately these arrays only allow a numeric index. While normal arrays only take a numeric subscript or index,
'awk' supports what is called an associative array or hash table. These data structures accept non-numeric keys to lookup the corresponding value. Thus our month to number mapping problem can be solved like this:
Code:
month["Jan"]="01"
month["Feb"]="02"
month["Mar"]="03"
A short command line example.
Code:
$ (echo Nov 10; echo Jan 21) | \
awk 'BEGIN { month["Nov"]="11" ; month["Jan"]="01" } { print month[$1], $2 }'
11 10
01 21
The two lines "Nov 10" and "Jan 21" are fed to a short
'awk' script. The
'awk' initialization block
BEGIN { month["Nov"]="11" ; month["Jan"]="01" } defines and stores mappings of the text strings "Nov" and "Jan" to "11" and "01" in the hash table
'month'.
The
{ print month[$1], $2 } block will be executed for each line, fed to
'awk'
For the first line, the 'Nov' field will be assigned to the variable
$1. The second field '10' to
$2. The command
print month[$1] will be expanded to
print month["Nov"], which will produce the corresponding value of "11".
This same procedure will be repeated for the "Jan 21" line, resulting into the "01 21" output.
The complete program so far:
Code:
#!/usr/bin/awk -f
BEGIN {
# associative array mapping month names -> numbers
month["Jan"]="01"
month["Feb"]="02"
month["Mar"]="03"
month["Apr"]="04"
month["May"]="05"
month["Jun"]="06"
month["Jul"]="07"
month["Aug"]="08"
month["Sep"]="09"
month["Oct"]="10"
month["Nov"]="11"
month["Dec"]="12"
}
{
print $6, $7, $8, $9, $5
}
Saved as
'reformat.awk' and after
chmod +x reformat.awk it produces the following :
Code:
$ ls -l | tail -3
-rw-r--r-- 1 j65nko j65nko 52 Dec 9 01:04 q
-rwxr-xr-x 1 j65nko j65nko 330 Dec 16 23:11 reformat.awk
-rw-r--r-- 1 j65nko j65nko 387 Jan 31 2007 tmp
$ ls -l | tail -3 | reformat.awk
Dec 9 01:04 q 52
Dec 16 23:11 reformat.awk 330
Jan 31 2007 tmp 387
Although we extracted the fields we wanted, they do not align nicely.
3.2 Formatting with printf
Both
'awk' and
'sh' know a
printf function similar to the one found in the standard C library.
To shorten the examples, we will first use the shell's
printf. See
printf(1)
To format a unsigned decimal number we use the
%u conversion specification. Between the '%' and 'u' we insert a padding character '0' and the field width '2' so we get
%02u.
Code:
$ printf "%02u\n" 1
01
$ printf "%02u\n" 12
12
A test using
'awk':
Code:
$ echo Jan 1 | awk '{ printf "%s %02u \n", $1, $2 }'
Jan 01
The name of the month in
$1 is printed with a
%s conversion specifier , while the '1' assigned to
$2 will be formatted according to
%02u.
The other problem was the same field could have either a time string of 5 positions, or a year only using 4 positions. By choosing a field width of 5 and a blank for padding we can solve this last issue.
Code:
$ printf "% 5u\n" 2007
2007
$ printf "% 5u\n" 12:11
printf: 12:11: not completely converted
12
The
%u tries to convert the time 12:11, to an unsigned integer. This of course fails, because ':' is not a number. So we specify a string conversion with
%s.
Code:
$ printf "% +5s\n" 12:11 2007
12:11
2007
Although formatting the remaining fields is not needed in order to solve the sorting problem, we will have a look at the others, just for completeness sake.
File name aligned in a field of 40 positions:
Code:
$ printf "%40s\n" "zope-coreblog-1.2p0.tgz"
zope-coreblog-1.2p0.tgz
The justification to the right is default. For a left adjustment we insert a '-', before the field width '40'.
Code:
$ $ printf "%-40s\n" "zope-coreblog-1.2p0.tgz"
zope-coreblog-1.2p0.tgz
For the file size we use the
%u conversion specifier, with "10' as field width.
Code:
$ printf "%10u\n" 169395
169395
Combining the file name and file size in one statement:
Code:
$ printf "%-40s %10u\n" "zope-coreblog-1.2p0.tgz" "169395"
zope-coreblog-1.2p0.tgz 169395
3.3 The complete program in 'awk'
Merging all the conversion specifications into one statement, the final
printf for
'awk' is:
Code:
printf "%s-%02u % 5s %-40s %10u\n", month[$6], $7, $8, $9, $5
%s : prints the number produced by month[$6]
- : prints a literal '-' to separate month from day
%02u : day of the month as found in $7
: a space
% 5s : the time or year from $8 aligned right in field of 5
: a space
%-40s : file name from $9, left aligned in a 40 positions wide field
: a space
%10u : file size in $5, justified right in a 10 position field
\n : a newline or linefeed
The resulting program:
Code:
#!/usr/bin/awk -f
# -rw-r--r-- 1 276 125 8251 Dec 13 11:56 915resolution-0.5.3.tgz
# -rw-r--r-- 1 276 125 121406 Dec 13 11:56 9libs-1.0p4.tgz
# $1 $2 $3 $4 $5 $6 $7 $8 $9
BEGIN {
# associative array mapping month names -> numbers
month["Jan"]="01"
month["Feb"]="02"
month["Mar"]="03"
month["Apr"]="04"
month["May"]="05"
month["Jun"]="06"
month["Jul"]="07"
month["Aug"]="08"
month["Sep"]="09"
month["Oct"]="10"
month["Nov"]="11"
month["Dec"]="12"
}
{
#print $6, $7, $8, $9, $5
printf "%s-%02u % 5s %-40s %10u\n", month[$6], $7, $8, $9, $5
}
By adding the two sample lines and the $ fields variables assignments, we documented in a simple way, what the program is actually doing.
A test, however, reveals a slight problem.
Code:
$ head -4 pkg_snap-alberta | reformat.awk
-00 0
12-13 11:56 915resolution-0.5.3.tgz 8251
12-13 11:56 9libs-1.0p4.tgz 121406
12-13 11:56 9menu-1.7.tgz 9877
A closer look at the first four lines:
Code:
$ head -4 pkg_snap-alberta
total 18016088
-rw-r--r-- 1 276 125 8251 Dec 13 11:56 915resolution-0.5.3.tgz
-rw-r--r-- 1 276 125 121406 Dec 13 11:56 9libs-1.0p4.tgz
-rw-r--r-- 1 276 125 9877 Dec 13 11:56 9menu-1.7.tgz
The cause is the first line showing the total number of blocks.
A simple solution is to just ignore lines which do not have 9 fields.
'awk' has a variable called
NF containing the number of fields, and which is updated for each line.
Code:
{ if ( NF != 9 ) { next }
printf "%s-%02u % 5s %-40s %10u\n", month[$6], $7, $8, $9, $5
}
So if the number of fields, stored in
NF, is not equal to 9
'awk' will start processing the next input line and thus ignore the remainder of the script.
Code:
$ head -4 pkg_snap-alberta | reformat.awk
12-13 11:56 915resolution-0.5.3.tgz 8251
12-13 11:56 9libs-1.0p4.tgz 121406
12-13 11:56 9menu-1.7.tgz 9877
Two examples
Code:
$ reformat.awk < home/j65nko/packages_snap | sort | head -4
12-13 11:56 915resolution-0.5.3.tgz 8251
12-13 11:56 9libs-1.0p4.tgz 121406
12-13 11:56 9menu-1.7.tgz 9877
12-13 11:56 9wm-1.2prep0.tgz 22061
$ reformat.awk < /home/j65nko/packages_snap | sort | tail -4
12-13 12:03 ztrack-1.0.tgz 8718
12-13 12:03 zziplib-0.13.49.tgz 98528
12-13 12:03 zzuf-0.12.tgz 124942
12-14 04:07 index.txt 117936
3.4 Back to the original problem
Extracting the date information, is now simply printing fields 6, 7 and 8:
Code:
if ( NF != 9 ) { next }
printf "%s-%02u % 5s\n", month[$6], $7, $8
Saving the modified script as
'extract-dates.awk', we now can do the following, to see the unique dates of the package directory listings.
Code:
$ extract-dates.awk pkg_snap-alberta | sort -u
12-13 11:56
12-13 11:57
12-13 11:58
12-13 11:59
12-13 12:00
12-13 12:01
12-13 12:02
12-13 12:03
12-14 04:07
The oldest package is dated 13th of Dec 11:56. We see a regular minute increment up to 12:03. But then there is big gap to the 14th of Dec 04:07.
Using
'grep' it is easy to find the files dated '12-14 04:07':
Code:
$ reformat.awk pkg_snap-alberta | grep '12-14 04:07'
12-14 04:07 index.txt 117936
It is the
'index.txt' which is an index file, which usually is generated some hours after a directory update.
Repeating the same procedure for a listing from the ftp.stacken.kth.se mirror:
Code:
$ extract-dates.awk pkg_snap-stacken | sort -u | cat -n
1 12-13 18:56
2 12-13 18:57
3 12-13 18:58
4 12-13 18:59
5 12-13 19:00
6 12-13 19:01
7 12-13 19:02
8 12-13 19:03
9 12-14 11:07
We see the same pattern, albeit with a seven hours difference, caused by the different time zones of Alberta, Canada and Stockholm, Sweden.
A listing of a mirror in the middle of an update:
Code:
11-08 17:06
11-08 17:07
11-08 17:08
11-08 17:09
11-08 17:10
11-15 19:47
Here the packages of the 8th of November, only have been partially replaced by the new ones dated November 15.
A ftp site of which the system administrator in an email confirmed to have forgotten the
'rsync' 'delete' option:
Code:
01-07 2008
01-22 2008
01-31 15:17
01-31 15:18
01-31 15:19
01-31 15:20
01-31 15:21
01-31 15:22
01-31 15:23
02-16 22:24
02-16 22:25
02-16 22:26
02-16 22:27
02-16 22:28
02-16 22:29
[snip]
07-23 20:02
07-23 20:03
07-24 10:06
08-20 2007
11-18 2007
12-13 2007
12-29 2007
Upgrading your packages from such an mirror would be a recipe for a disaster. Luckily a rather simple script can warn you for such a scenario.
4 A 'perl' approach
The
'perl' solution also uses an associative array or hash and is very similar to the
'awk' script.
Code:
1 #!/usr/bin/perl -w
2
3 use strict ;
4
5 my %month = qw ( Jan 01 Feb 02 Mar 03 Apr 04 May 05 Jun 06
6 Jul 07 Aug 08 Sep 09 Oct 10 Nov 11 Dec 12 ) ;
7
8 # Format of ftp listing
9 # -rw-r--r-- 1 0 122 109360 Dec 5 13:45 INSTALL.i386
10 # -rw-r--r-- 1 0 122 109360 Jan 5 2007 INSTALL.i386
11 # $perm $blk $user $group $size $m $day $time $name
12
13 sub reformat {
14 my ($perm, $blk, $user, $group, $size, $m, $day, $time, $name) ;
15 my $len = 40 ;
16
17 while ( <> ) {
18 next if /^\s*$/ ; # skip empty or whitespace only lines
19 next if /^total/i ; # skip 'total blocks' line
20 ($perm, $blk, $user, $group, $size, $m, $day, $time, $name) = () ;
21 ($perm, $blk, $user, $group, $size, $m, $day, $time, $name) = split ;
22 printf("%s-%0+2u %+5s %-${len}s %+10u\n", $month{$m}, $day, $time, $name, $size) ;
23 }
24 }
25
26 reformat() ;
We follow the recommended practice of using the
-w option, in order to have perl issues warnings about questionable constructs.
The 'use strict' pragma forces the programmer to declare all the variables and subroutines he will be using.
Lines 5-6 declare and initialize the associative array or hash
'%month'. To print the number of the month of July you would have to use:
print $month{Jul} . This is not much different from
print month["Jul"] in
'awk'.
Unlike
'awk',
'perl' does not automatically split lines into variables. One advantage is that we can give meaningful names to the fields. We declare them as local variables to the
'reformat' subroutine in line 14.
An alternative would be to use an array, e.g.
@fields for storage so we could retrieve the month field with
$field[6]
The variable
'$len' is used in the
'printf' statement to allow easy change of the width of the file name field.
Lines 17-23 contain a loop statement which reads from standard input. Each line is assigned to the variable
$_. This variable, unless overridden, is used as default in many
'perl' contructs.
For example the skipping of empty lines or whitespace-only lines as done in line 18, actually is a short cut for
next if ( $_ =~ /^\s*$/ ) .
A break down of the regular expression:
Code:
/ : starting delimiter
^ : beginning of line
\s : whitespace characters like spaces, tabs etc
* : quantifier indicating zero or one occurrence(s) of the preceding atom '\s'
$ : end of line
/ : closing delimiter
In program line 19, lines starting with the text 'total' are also skipped. Here again
$_, holding the current line read from standard input, is the implicit variable. The 'i' modifier makes the pattern match a case-insensitive one.
The variables to receive the fields are emptied in line 20.
The splitting of the line into fields is done with the function
'split'. By default, just like in
'awk', this splitting uses whitespace as separator. It is also another instance where
$_ is assumed to hold the text to be split.
The
'perl' printf has a similar syntax as the one in
'libc' from the C programming language. A notable exception is that
'perl' allows you to use a variable in the format conversion specifiers, as is done with the variable
$len in line 22.
Appendices
A.1 The 'awk' directory listing reformatter
Code:
#!/usr/bin/awk -f
# -rw-r--r-- 1 276 125 8251 Dec 13 11:56 915resolution-0.5.3.tgz
# -rw-r--r-- 1 276 125 121406 Dec 13 11:56 9libs-1.0p4.tgz
# $1 $2 $3 $4 $5 $6 $7 $8 $9
BEGIN {
# associative array mapping month names -> numbers
month["Jan"]="01"
month["Feb"]="02"
month["Mar"]="03"
month["Apr"]="04"
month["May"]="05"
month["Jun"]="06"
month["Jul"]="07"
month["Aug"]="08"
month["Sep"]="09"
month["Oct"]="10"
month["Nov"]="11"
month["Dec"]="12"
}
{ if ( NF != 9 ) { next }
printf "%s-%02u % 5s %-40s %10u\n", month[$6], $7, $8, $9, $5
}
Because laziness is a programmer's virtue, a short shell script generated the initializations statements for the associative array.
Code:
i=0
for M in Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ; do
# month["Jan"]="01"
printf "month[\"%s\"]=\"%02u\"\n" $M $((++i))
done
A.1.1 A pattern matching variation
Code:
#!/usr/bin/awk -f
# -rw-r--r-- 1 276 125 8251 Dec 13 11:56 915resolution-0.5.3.tgz
# -rw-r--r-- 1 276 125 121406 Dec 13 11:56 9libs-1.0p4.tgz
# $1 $2 $3 $4 $5 $6 $7 $8 $9
$6 ~ /Jan/ { $6 = "01" }
$6 ~ /Feb/ { $6 = "02" }
$6 ~ /Mar/ { $6 = "03" }
$6 ~ /Apr/ { $6 = "04" }
$6 ~ /May/ { $6 = "05" }
$6 ~ /Jun/ { $6 = "06" }
$6 ~ /Jul/ { $6 = "07" }
$6 ~ /Aug/ { $6 = "08" }
$6 ~ /Sep/ { $6 = "09" }
$6 ~ /Oct/ { $6 = "10" }
$6 ~ /Nov/ { $6 = "11" }
$6 ~ /Dec/ { $6 = "12" }
{
if ( NF != 9 ) { next }
printf "%s-%0+2u %+5s %-40s %+10u\n", $6, $7, $8, $9, $5
}
Instead of an associative array, pattern matching is being used here. Paraphrasing the first line: "If $6 matches the string 'Jan', store the string '01' in $6".
The twelve pattern matching statements were generated with a simple shell script.
Code:
i=0
for M in Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ; do
# $6 ~ /Jan/ { $6 = "01" }
printf " \$6 ~ /%s/ { \$6 = \"%02u\" }\n" $M $((++i))
done
A.1.2 The 'awk' date extractor
Code:
#!/usr/bin/awk -f
# -rw-r--r-- 1 276 125 8251 Dec 13 11:56 915resolution-0.5.3.tgz
# -rw-r--r-- 1 276 125 121406 Dec 13 11:56 9libs-1.0p4.tgz
# $1 $2 $3 $4 $5 $6 $7 $8 $9
BEGIN {
# associative array mapping month names -> numbers
month["Jan"]="01"
month["Feb"]="02"
month["Mar"]="03"
month["Apr"]="04"
month["May"]="05"
month["Jun"]="06"
month["Jul"]="07"
month["Aug"]="08"
month["Sep"]="09"
month["Oct"]="10"
month["Nov"]="11"
month["Dec"]="12"
}
{
if ( NF != 9 ) { next }
printf "%s-%02u % 5s\n", month[$6], $7, $8
}
B.1 The 'perl' reformatter
Code:
#!/usr/bin/perl -w
use strict ;
my %month = qw ( Jan 01 Feb 02 Mar 03 Apr 04 May 05 Jun 06
Jul 07 Aug 08 Sep 09 Oct 10 Nov 11 Dec 12 ) ;
# Format of ftp listing
# -rw-r--r-- 1 0 122 109360 Dec 5 13:45 INSTALL.i386
# -rw-r--r-- 1 0 122 109360 Jan 5 2007 INSTALL.i386
# $perm $blk $user $group $size $m $day $time $name
sub reformat {
my ($perm, $blk, $user, $group, $size, $m, $day, $time, $name) ;
my $len = 40 ;
while ( <> ) {
next if /^\s*$/ ; # skip empty or whitespace only lines
next if /^total/i ; # skip 'total blocks' line
($perm, $blk, $user, $group, $size, $m, $day, $time, $name) = () ;
($perm, $blk, $user, $group, $size, $m, $day, $time, $name) = split ;
printf("%s-%0+2u %+5s %-${len}s %+10u\n", $month{$m}, $day, $time, $name, $size) ;
}
}
reformat() ;
$Id: Reformatting.xml,v 1.4 2008/12/23 22:13:06 j65nko Exp $
$Id: book-vbul-html.xsl,v 1.3 2008/12/24 02:59:45 j65nko Exp $