Reformatting ftp listings with awk and perl

J65nko · #1 **(View Single Post)** 24th December 2008

Reformatting ftp listings with awk and perl

Table of contents

1 Introduction
2 The problem
3 Using 'awk'
3.1 Associative array or hash table
3.2 Formatting with printf
3.3 The complete program in 'awk'
3.4 Back to the original problem
4 A 'perl' approach

Appendices

A.1 The 'awk' directory listing reformatter
A.1.1 A pattern matching variation
A.1.2 The 'awk' date extractor
B.1 The 'perl' reformatter

1 Introduction

While the 'ls' utility provides a wealth of options to display a directory listing, you are left in the cold if you want to customize the format of a directory listing retrieved from a ftp server.

I needed a listing sorted by date.

The first script is written in 'awk', the 'perl' script accomplishes the same goal.

2 The problem

Before I install pre-compiled packages, I need to make sure that the ftp mirror has a complete copy of the package repository at ftp.openbsd.org.

Retrieval of such a listing can be done easily from the command line:

Code:

$ echo 'ls /pub/OpenBSD/snapshots/packages/i386 packages_snap' | \
ftp -4ai ftp.calyx.nl

The ls as described in ftp(1):

Code:

ls [remote-directory [local-file]]
	Print a listing of the contents of a directory on the remote
	machine.  The listing includes any system-dependent informa-
	tion that the server chooses to include; for example, most
	UNIX systems will produce output from the command `ls -l'.
	If remote-directory is left unspecified, the current working
	directory is used.  If interactive prompting is on, ftp will
	prompt the user to verify that the last argument is indeed
	the target local file for receiving ls output.  If no local
	file is specified, or if local-file is `-', the output is
	sent to the terminal.

The file 'packages_snap' is the local file to store the output.

The options:

Code:

-4      Forces ftp to use IPv4 addresses only.
-a      Causes ftp to bypass the normal login procedure and use an anony-
        mous login instead.
-i      Turns off interactive prompting during multiple file transfers.

Although only mentioning multiple file transfer prompts, the -i option also turns off the prompt, asking whether local-file is indeed the file to receive the ls output.

Code:

$ head -6  packages_snap
total 18002768
-rw-r--r--  1 276  125       8246 Dec 11 07:35 915resolution-0.5.3.tgz
-rw-r--r--  1 276  125     121402 Dec 11 07:35 9libs-1.0p4.tgz
-rw-r--r--  1 276  125       9874 Dec 11 07:35 9menu-1.7.tgz
-rw-r--r--  1 276  125      22054 Dec 11 07:35 9wm-1.2prep0.tgz
-rw-r--r--  1 276  125     215250 Dec 11 07:35 AcePerl-1.91p0-opt.tgz

Having a local copy of the package directory listing, then I would need to extract the month, day and time fields and write them to a temporary file. The last step is sorting with 'sort -u' to produce a listing of the unique dates.

3 Using 'awk'

In an interview with the Australian Computer World magazine Alfred V. Aho, one of the three architects of the 'awk' programming language, summarizes the language as follows:

"AWK is a language for processing files of text. A file is treated as a sequence of records, and by default each line is a record. Each line is broken up into a sequence of fields, so we can think of the first word in a line as the first field, the second word as the second field, and so on. An AWK program is of a sequence of pattern-action statements. AWK reads the input a line at a time. A line is scanned for each pattern in the program, and for each pattern that matches, the associated action is executed."

When 'awk' reads a line, these words or fields will be automatically split and stored in numbered variables like $1, $2, $3 etc. Taking a line from a directory listing as example.

Code:

-rw-r--r--  1 276  125       8246 Dec 11 07:35 915resolution-0.5.3.tgz

$1	:  -rw-r--r--
$2	:  1
$3	:  276
$4	:  125
$5	:  8246
$6	:  Dec
$7	:  11
$8	:  07:35
$9	:  915resolution-0.5.3.tgz

Being interested in the month ($6), day ($7), time ($8), name ($9) and size ($5) fields, a basic 'awk' script would look like this:

Code:

$ awk '{ print $6, $7, $8, $9, $5 }' test-file

Dec 11 07:35 915resolution-0.5.3.tgz 8246
Dec 11 07:35 9libs-1.0p4.tgz 121402

The only thing left is to convert the name of months like Nov and Dec into their ordinals 11 and 12.

3.1 Associative array or hash table

Modern shells do support arrays. For example:

Code:

$ month[1]=Jan
$ month[2]=Feb
$ month[3]=Mar
$ for N in 1 2 3 ; do echo Month $N = ${month[$N]} ; done
Month 1 = Jan
Month 2 = Feb
Month 3 = Mar

Unfortunately these arrays only allow a numeric index. While normal arrays only take a numeric subscript or index, 'awk' supports what is called an associative array or hash table. These data structures accept non-numeric keys to lookup the corresponding value. Thus our month to number mapping problem can be solved like this:

Code:

month["Jan"]="01"
month["Feb"]="02"
month["Mar"]="03"

A short command line example.

Code:

$ (echo Nov 10; echo Jan 21) | \
awk 'BEGIN { month["Nov"]="11" ; month["Jan"]="01" } { print month[$1], $2 }' 
11 10
01 21

The two lines "Nov 10" and "Jan 21" are fed to a short 'awk' script. The 'awk' initialization block BEGIN { month["Nov"]="11" ; month["Jan"]="01" } defines and stores mappings of the text strings "Nov" and "Jan" to "11" and "01" in the hash table 'month'.

The { print month[$1], $2 } block will be executed for each line, fed to 'awk'

For the first line, the 'Nov' field will be assigned to the variable $1. The second field '10' to $2. The command print month[$1] will be expanded to print month["Nov"], which will produce the corresponding value of "11".

This same procedure will be repeated for the "Jan 21" line, resulting into the "01 21" output.

The complete program so far:

Code:

#!/usr/bin/awk -f 

BEGIN {
# associative array mapping month names -> numbers 
month["Jan"]="01"
month["Feb"]="02"
month["Mar"]="03"
month["Apr"]="04"
month["May"]="05"
month["Jun"]="06"
month["Jul"]="07"
month["Aug"]="08"
month["Sep"]="09"
month["Oct"]="10"
month["Nov"]="11"
month["Dec"]="12"
}

{
 print $6, $7, $8, $9, $5 
}

Saved as 'reformat.awk' and after chmod +x reformat.awk it produces the following :

Code:

$ ls -l | tail -3
 -rw-r--r--  1 j65nko  j65nko      52 Dec  9 01:04 q
 -rwxr-xr-x  1 j65nko  j65nko     330 Dec 16 23:11 reformat.awk
 -rw-r--r--  1 j65nko  j65nko     387 Jan 31  2007 tmp

$ ls -l | tail -3 | reformat.awk
Dec 9 01:04 q 52
Dec 16 23:11 reformat.awk 330
Jan 31 2007 tmp 387

Although we extracted the fields we wanted, they do not align nicely.

3.2 Formatting with printf

Both 'awk' and 'sh' know a printf function similar to the one found in the standard C library.

To shorten the examples, we will first use the shell's printf. See printf(1)

To format a unsigned decimal number we use the %u conversion specification. Between the '%' and 'u' we insert a padding character '0' and the field width '2' so we get %02u.

Code:

$ printf "%02u\n" 1 
01
$ printf "%02u\n" 12
12

A test using 'awk':

Code:

$ echo Jan  1 | awk '{ printf "%s %02u  \n", $1, $2  }'
Jan 01

The name of the month in $1 is printed with a %s conversion specifier , while the '1' assigned to $2 will be formatted according to %02u.

The other problem was the same field could have either a time string of 5 positions, or a year only using 4 positions. By choosing a field width of 5 and a blank for padding we can solve this last issue.

Code:

$ printf "% 5u\n" 2007       
 2007
$ printf "% 5u\n" 12:11
printf: 12:11: not completely converted
   12

The %u tries to convert the time 12:11, to an unsigned integer. This of course fails, because ':' is not a number. So we specify a string conversion with %s.

Code:

$ printf "% +5s\n" 12:11 2007
12:11
 2007

Although formatting the remaining fields is not needed in order to solve the sorting problem, we will have a look at the others, just for completeness sake.

File name aligned in a field of 40 positions:

Code:

$ printf "%40s\n" "zope-coreblog-1.2p0.tgz"
                 zope-coreblog-1.2p0.tgz

The justification to the right is default. For a left adjustment we insert a '-', before the field width '40'.

Code:

$ $ printf "%-40s\n" "zope-coreblog-1.2p0.tgz"
zope-coreblog-1.2p0.tgz

For the file size we use the %u conversion specifier, with "10' as field width.

Code:

$ printf "%10u\n" 169395
    169395

Combining the file name and file size in one statement:

Code:

$ printf "%-40s %10u\n" "zope-coreblog-1.2p0.tgz" "169395"
zope-coreblog-1.2p0.tgz                      169395

3.3 The complete program in 'awk'

Merging all the conversion specifications into one statement, the final printf for 'awk' is:

Code:

printf "%s-%02u % 5s %-40s %10u\n", month[$6], $7, $8, $9,  $5

%s	: prints the number produced by month[$6]
-	: prints a literal '-' to separate month from day
%02u	: day of the month as found in $7
	: a space
% 5s	: the time or year from $8 aligned right in field of 5
	: a space
%-40s	: file name from $9, left aligned in a 40 positions wide field 
	: a space
%10u	: file size in $5, justified right in a 10 position field
\n	: a newline or linefeed

The resulting program:

Code:

#!/usr/bin/awk -f

# -rw-r--r--  1 276  125       8251 Dec 13 11:56 915resolution-0.5.3.tgz
# -rw-r--r--  1 276  125     121406 Dec 13 11:56 9libs-1.0p4.tgz
#    $1      $2  $3   $4       $5    $6 $7  $8       $9 

BEGIN {
# associative array mapping month names -> numbers 
month["Jan"]="01"
month["Feb"]="02"
month["Mar"]="03"
month["Apr"]="04"
month["May"]="05"
month["Jun"]="06"
month["Jul"]="07"
month["Aug"]="08"
month["Sep"]="09"
month["Oct"]="10"
month["Nov"]="11"
month["Dec"]="12"
}

{ 
 #print $6, $7, $8, $9, $5 
 printf "%s-%02u % 5s %-40s %10u\n", month[$6], $7, $8, $9,  $5
}

By adding the two sample lines and the $ fields variables assignments, we documented in a simple way, what the program is actually doing.
A test, however, reveals a slight problem.

Code:

$ head -4 pkg_snap-alberta | reformat.awk
-00                                                         0
12-13 11:56 915resolution-0.5.3.tgz                        8251
12-13 11:56 9libs-1.0p4.tgz                              121406
12-13 11:56 9menu-1.7.tgz                                  9877

A closer look at the first four lines:

Code:

$ head -4 pkg_snap-alberta
total 18016088
-rw-r--r--  1 276  125       8251 Dec 13 11:56 915resolution-0.5.3.tgz
-rw-r--r--  1 276  125     121406 Dec 13 11:56 9libs-1.0p4.tgz
-rw-r--r--  1 276  125       9877 Dec 13 11:56 9menu-1.7.tgz

The cause is the first line showing the total number of blocks.

A simple solution is to just ignore lines which do not have 9 fields. 'awk' has a variable called NF containing the number of fields, and which is updated for each line.

Code:

{  if ( NF != 9 ) { next } 
   printf "%s-%02u % 5s %-40s %10u\n", month[$6], $7, $8, $9,  $5 
   }

So if the number of fields, stored in NF, is not equal to 9 'awk' will start processing the next input line and thus ignore the remainder of the script.

Code:

$ head -4 pkg_snap-alberta | reformat.awk
12-13 11:56 915resolution-0.5.3.tgz                        8251
12-13 11:56 9libs-1.0p4.tgz                              121406
12-13 11:56 9menu-1.7.tgz                                  9877

Two examples

Code:

$ reformat.awk < home/j65nko/packages_snap | sort | head  -4
12-13 11:56 915resolution-0.5.3.tgz                        8251
12-13 11:56 9libs-1.0p4.tgz                              121406
12-13 11:56 9menu-1.7.tgz                                  9877
12-13 11:56 9wm-1.2prep0.tgz                              22061

$ reformat.awk < /home/j65nko/packages_snap | sort | tail -4
12-13 12:03 ztrack-1.0.tgz                                 8718
12-13 12:03 zziplib-0.13.49.tgz                           98528
12-13 12:03 zzuf-0.12.tgz                                124942
12-14 04:07 index.txt                                    117936

3.4 Back to the original problem

Extracting the date information, is now simply printing fields 6, 7 and 8:

Code:

if ( NF != 9 ) { next }
printf "%s-%02u % 5s\n", month[$6], $7, $8

Saving the modified script as 'extract-dates.awk', we now can do the following, to see the unique dates of the package directory listings.

Code:

$ extract-dates.awk pkg_snap-alberta | sort -u
12-13 11:56
12-13 11:57
12-13 11:58
12-13 11:59
12-13 12:00
12-13 12:01
12-13 12:02
12-13 12:03
12-14 04:07

The oldest package is dated 13th of Dec 11:56. We see a regular minute increment up to 12:03. But then there is big gap to the 14th of Dec 04:07.

Using 'grep' it is easy to find the files dated '12-14 04:07':

Code:

$ reformat.awk pkg_snap-alberta | grep '12-14 04:07'
12-14 04:07 index.txt                                    117936

It is the 'index.txt' which is an index file, which usually is generated some hours after a directory update.

Repeating the same procedure for a listing from the ftp.stacken.kth.se mirror:

Code:

$ extract-dates.awk pkg_snap-stacken  | sort -u | cat -n
     1  12-13 18:56
     2  12-13 18:57
     3  12-13 18:58
     4  12-13 18:59
     5  12-13 19:00
     6  12-13 19:01
     7  12-13 19:02
     8  12-13 19:03
     9  12-14 11:07

We see the same pattern, albeit with a seven hours difference, caused by the different time zones of Alberta, Canada and Stockholm, Sweden.

A listing of a mirror in the middle of an update:

Code:

11-08 17:06
11-08 17:07
11-08 17:08
11-08 17:09
11-08 17:10
11-15 19:47

Here the packages of the 8th of November, only have been partially replaced by the new ones dated November 15.

A ftp site of which the system administrator in an email confirmed to have forgotten the 'rsync' 'delete' option:

Code:

01-07  2008
01-22  2008
01-31 15:17
01-31 15:18
01-31 15:19
01-31 15:20
01-31 15:21
01-31 15:22
01-31 15:23
02-16 22:24
02-16 22:25
02-16 22:26
02-16 22:27
02-16 22:28
02-16 22:29
[snip]
07-23 20:02
07-23 20:03
07-24 10:06
08-20  2007
11-18  2007
12-13  2007
12-29  2007

Upgrading your packages from such an mirror would be a recipe for a disaster. Luckily a rather simple script can warn you for such a scenario.

4 A 'perl' approach

The 'perl' solution also uses an associative array or hash and is very similar to the 'awk' script.

Code:

     1  #!/usr/bin/perl -w
     2  
     3  use strict ;
     4  
     5  my %month = qw ( Jan 01 Feb 02 Mar 03 Apr 04 May 05 Jun 06 
     6                   Jul 07 Aug 08 Sep 09 Oct 10 Nov 11 Dec 12 ) ;
     7  
     8  # Format of ftp listing 
     9  # -rw-r--r--  1    0     122     109360 Dec  5    13:45 INSTALL.i386
    10  # -rw-r--r--  1    0     122     109360 Jan  5     2007 INSTALL.i386
    11  #   $perm    $blk $user $group   $size  $m  $day  $time   $name
    12  
    13  sub reformat {
    14      my ($perm, $blk, $user, $group, $size, $m, $day, $time, $name) ;
    15      my $len = 40 ;
    16  
    17      while ( <> ) {
    18          next if /^\s*$/ ;    # skip empty or whitespace only lines
    19          next if /^total/i ;  # skip 'total blocks' line
    20          ($perm, $blk, $user, $group, $size, $m, $day, $time, $name) = () ;
    21          ($perm, $blk, $user, $group, $size, $m, $day, $time, $name) = split ;
    22          printf("%s-%0+2u %+5s %-${len}s %+10u\n", $month{$m}, $day, $time, $name, $size) ;
    23      }
    24  }
    25  
    26  reformat() ;

We follow the recommended practice of using the -w option, in order to have perl issues warnings about questionable constructs.

The 'use strict' pragma forces the programmer to declare all the variables and subroutines he will be using.

Lines 5-6 declare and initialize the associative array or hash '%month'. To print the number of the month of July you would have to use: print $month{Jul} . This is not much different from print month["Jul"] in 'awk'.

Unlike 'awk', 'perl' does not automatically split lines into variables. One advantage is that we can give meaningful names to the fields. We declare them as local variables to the 'reformat' subroutine in line 14.

An alternative would be to use an array, e.g. @fields for storage so we could retrieve the month field with $field[6]

Code:

@fields = split ;

The variable '$len' is used in the 'printf' statement to allow easy change of the width of the file name field.

Lines 17-23 contain a loop statement which reads from standard input. Each line is assigned to the variable $_. This variable, unless overridden, is used as default in many 'perl' contructs.

For example the skipping of empty lines or whitespace-only lines as done in line 18, actually is a short cut for next if ( $_ =~ /^\s*$/ ) .

A break down of the regular expression:

Code:

/	: starting delimiter
^	: beginning of line
\s	: whitespace characters like spaces, tabs etc
*	: quantifier indicating zero or one occurrence(s) of the preceding atom '\s' 
$	: end of line
/	: closing delimiter

In program line 19, lines starting with the text 'total' are also skipped. Here again $_, holding the current line read from standard input, is the implicit variable. The 'i' modifier makes the pattern match a case-insensitive one.

The variables to receive the fields are emptied in line 20.

The splitting of the line into fields is done with the function 'split'. By default, just like in 'awk', this splitting uses whitespace as separator. It is also another instance where $_ is assumed to hold the text to be split.

The 'perl' printf has a similar syntax as the one in 'libc' from the C programming language. A notable exception is that 'perl' allows you to use a variable in the format conversion specifiers, as is done with the variable $len in line 22.

Appendices

A.1 The 'awk' directory listing reformatter

Code:

#!/usr/bin/awk -f

# -rw-r--r--  1 276  125       8251 Dec 13 11:56 915resolution-0.5.3.tgz
# -rw-r--r--  1 276  125     121406 Dec 13 11:56 9libs-1.0p4.tgz
#    $1      $2  $3   $4       $5    $6 $7  $8       $9

BEGIN {
# associative array mapping month names -> numbers 
month["Jan"]="01"
month["Feb"]="02"
month["Mar"]="03"
month["Apr"]="04"
month["May"]="05"
month["Jun"]="06"
month["Jul"]="07"
month["Aug"]="08"
month["Sep"]="09"
month["Oct"]="10"
month["Nov"]="11"
month["Dec"]="12"
}

{  if ( NF != 9 ) { next } 
   printf "%s-%02u % 5s %-40s %10u\n", month[$6], $7, $8, $9,  $5
}

Because laziness is a programmer's virtue, a short shell script generated the initializations statements for the associative array.

Code:

i=0

for M in Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ; do
    #        month["Jan"]="01"
    printf "month[\"%s\"]=\"%02u\"\n" $M $((++i))
done

A.1.1 A pattern matching variation

Code:

#!/usr/bin/awk -f

# -rw-r--r--  1 276  125       8251 Dec 13 11:56 915resolution-0.5.3.tgz
# -rw-r--r--  1 276  125     121406 Dec 13 11:56 9libs-1.0p4.tgz
#    $1      $2  $3   $4       $5    $6 $7  $8       $9 

  $6 ~ /Jan/  { $6 = "01" }
  $6 ~ /Feb/  { $6 = "02" }
  $6 ~ /Mar/  { $6 = "03" }
  $6 ~ /Apr/  { $6 = "04" }
  $6 ~ /May/  { $6 = "05" }
  $6 ~ /Jun/  { $6 = "06" }
  $6 ~ /Jul/  { $6 = "07" }
  $6 ~ /Aug/  { $6 = "08" }
  $6 ~ /Sep/  { $6 = "09" }
  $6 ~ /Oct/  { $6 = "10" }
  $6 ~ /Nov/  { $6 = "11" }
  $6 ~ /Dec/  { $6 = "12" }

{
  if ( NF != 9 ) { next }
  printf "%s-%0+2u %+5s %-40s %+10u\n", $6, $7, $8, $9,  $5
}

Instead of an associative array, pattern matching is being used here. Paraphrasing the first line: "If $6 matches the string 'Jan', store the string '01' in $6".

The twelve pattern matching statements were generated with a simple shell script.

Code:

i=0

for M in Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ; do
            #  $6 ~ /Jan/  { $6 = "01" }
    printf "  \$6 ~ /%s/  { \$6 = \"%02u\" }\n" $M $((++i))
done

A.1.2 The 'awk' date extractor

Code:

#!/usr/bin/awk -f

# -rw-r--r--  1 276  125       8251 Dec 13 11:56 915resolution-0.5.3.tgz
# -rw-r--r--  1 276  125     121406 Dec 13 11:56 9libs-1.0p4.tgz
#    $1      $2  $3   $4       $5    $6 $7  $8       $9 

BEGIN {
# associative array mapping month names -> numbers 
month["Jan"]="01"
month["Feb"]="02"
month["Mar"]="03"
month["Apr"]="04"
month["May"]="05"
month["Jun"]="06"
month["Jul"]="07"
month["Aug"]="08"
month["Sep"]="09"
month["Oct"]="10"
month["Nov"]="11"
month["Dec"]="12"
}

{ 

if ( NF != 9 ) { next }
printf "%s-%02u % 5s\n", month[$6], $7, $8

}

B.1 The 'perl' reformatter

Code:

#!/usr/bin/perl -w

use strict ;

my %month = qw ( Jan 01 Feb 02 Mar 03 Apr 04 May 05 Jun 06 
                 Jul 07 Aug 08 Sep 09 Oct 10 Nov 11 Dec 12 ) ;

# Format of ftp listing 
# -rw-r--r--  1    0     122     109360 Dec  5    13:45 INSTALL.i386
# -rw-r--r--  1    0     122     109360 Jan  5     2007 INSTALL.i386
#   $perm    $blk $user $group   $size  $m  $day  $time   $name

sub reformat {
    my ($perm, $blk, $user, $group, $size, $m, $day, $time, $name) ;
    my $len = 40 ;

    while ( <> ) {
        next if /^\s*$/ ;    # skip empty or whitespace only lines
        next if /^total/i ;  # skip 'total blocks' line
        ($perm, $blk, $user, $group, $size, $m, $day, $time, $name) = () ;
        ($perm, $blk, $user, $group, $size, $m, $day, $time, $name) = split ;
        printf("%s-%0+2u %+5s %-${len}s %+10u\n", $month{$m}, $day, $time, $name, $size) ;
    }
}

reformat() ;

$Id: Reformatting.xml,v 1.4 2008/12/23 22:13:06 j65nko Exp $
$Id: book-vbul-html.xsl,v 1.3 2008/12/24 02:59:45 j65nko Exp $

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
perl expression syntax	qsecofr	Programming	3	16th February 2009 12:56 PM
Perl locale	Theta	OpenBSD General	3	9th January 2009 01:59 PM
Learning Perl	mtx	Book reviews	7	22nd October 2008 05:55 PM
perl/tk	bsdnewbie999	OpenBSD Packages and Ports	4	8th August 2008 12:34 AM
Perl Script	c0mrade	Programming	1	26th June 2008 05:04 AM