DaemonForums - View Single Post

J65nko · #1 **(View Single Post)** 12th December 2011

In a database import of over 200,000 records, I noticed that a length of 30 chars was not long enough to hold all email addresses.

With assistance of a short awk script I could quickly determine the optimal length:

Code:

$ awk -f field_length.awk import_file.csv

111628 065031051@rrrrrr.com
322626 0594500515@wwwwww.co.za
[snip]
361682 ｈｕｉｌｉｎ１９
７８@wrwrwrwr.ｃ
;ｏｍ
xxxxxxxxx@hotmail.com
146

This unbelievable longest mail address probably was caused by either an UTF-16 issue or data corruption.

The awk script:

Code:

BEGIN {
SIZE = 0 ;
FS = "," ;
}

# field 1 is the primary key
# field 2 is the email address
{
if (length($2) > SIZE ) {
   print $1, $2 ;
   SIZE = length($2) ;
   }
}

END {
print $2 ;
print SIZE ;
}

BTW sed(1) confirmed this:

Code:

$ sed -ne '/361682/p' masterlist_comma.csv 
361682 ｈｕｉｌｉｎ１９
７８@wrwrwrwr.ｃ
;ｏｍ

After editing out the offending line the result was:

Code:

 $ awk -f field_length.awk import_file2.csv
[snip]
58

So a field width of 60 seemed safe