DaemonForums  

Go Back   DaemonForums > Miscellaneous > Guides

Guides All Guides and HOWTO's.

Reply
 
Thread Tools Display Modes
  #1   (View Single Post)  
Old 12th December 2011
J65nko J65nko is offline
Administrator
 
Join Date: May 2008
Location: Budel - the Netherlands
Posts: 4,128
Default Tools to assist in database import

In a database import of over 200,000 records, I noticed that a length of 30 chars was not long enough to hold all email addresses.

With assistance of a short awk script I could quickly determine the optimal length:

Code:
$ awk -f field_length.awk import_file.csv

111628 065031051@rrrrrr.com
322626 0594500515@wwwwww.co.za
[snip]
361682 huilin19
78@wrwrwrwr.c
;om
xxxxxxxxx@hotmail.com
146
This unbelievable longest mail address probably was caused by either an UTF-16 issue or data corruption.

The awk script:
Code:
BEGIN {
SIZE = 0 ;
FS = "," ;
}

# field 1 is the primary key
# field 2 is the email address
{
if (length($2) > SIZE ) {
   print $1, $2 ;
   SIZE = length($2) ;
   }
}

END {
print $2 ;
print SIZE ;
}
BTW sed(1) confirmed this:
Code:
$ sed -ne '/361682/p' masterlist_comma.csv 
361682 huilin19
78@wrwrwrwr.c
;om
After editing out the offending line the result was:
Code:
 $ awk -f field_length.awk import_file2.csv
[snip]
58
So a field width of 60 seemed safe
__________________
You don't need to be a genius to debug a pf.conf firewall ruleset, you just need the guts to run tcpdump

Last edited by J65nko; 12th December 2011 at 06:17 AM.
Reply With Quote
  #2   (View Single Post)  
Old 12th December 2011
J65nko J65nko is offline
Administrator
 
Join Date: May 2008
Location: Budel - the Netherlands
Posts: 4,128
Default

In tomorrows episode a Perl script to count the number of comma's, and thus the fields, in an .csv file.
__________________
You don't need to be a genius to debug a pf.conf firewall ruleset, you just need the guts to run tcpdump
Reply With Quote
  #3   (View Single Post)  
Old 12th December 2011
ocicat ocicat is offline
Administrator
 
Join Date: Apr 2008
Posts: 3,318
Default

Quote:
Originally Posted by J65nko View Post
...a Perl script to count the number of comma's...
Consider the following (assuming comma are not found in the data...):
Code:
#!/bin/env perl

use strict;
use warnings;

# array of test strings
my @a = (',', ',,', ',,,');

my $max = 0;
foreach my $line (@a) {
    my @matches = ($line =~ /(\,)/g);
    $max = @matches if @matches > $max;
}

print "max #occurrences = $max\n";
TIMTOWTDI.


Last edited by ocicat; 12th December 2011 at 07:01 AM.
Reply With Quote
  #4   (View Single Post)  
Old 12th December 2011
J65nko J65nko is offline
Administrator
 
Join Date: May 2008
Location: Budel - the Netherlands
Posts: 4,128
Default

The issue I had on hand was, that the import file was over 200,000 records. So I needed to know exactly which record/line did not have 56 fields.

I took advantage of the split function which in scalar context returns the number of fields resulting from the split:

Code:
#!/usr/bin/perl

use warnings ;
use strict ;

my @temp ;
my $nr  ;

while (<>)  {
  chomp ;
  $nr = (@temp = split /,/) ; 
  print "$nr $temp[0]\n" ; 
}
A sample run:
Code:
$ head -10 masterlist_comma.csv | ./split.pl

56 id
56 100625
56 100626
56 100627
56 100628
56 100629
56 100630
56 100631
56 100632
56 100633
The first field 'id' is the primary index (unique) so I choose to print that too, and thus enabling me to locate any culprits.

I checked these culprits by first redirecting to file fields_count.txt and then use grep(1)

Code:
$ grep -v '^56 ' fields_count.txt
$
No culprits in the first 10 records There also were not any in the other 200,000.

PS why grep -v?
Code:
    -v      Selected lines are those not matching any of the specified
             patterns.
__________________
You don't need to be a genius to debug a pf.conf firewall ruleset, you just need the guts to run tcpdump
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
MySQL database import not helpful J65nko General software and network 2 10th December 2011 02:45 AM
When is a database not so relational? J65nko News 1 23rd May 2011 05:38 AM
zfs import terryd FreeBSD General 1 23rd November 2008 09:26 PM
import physical freeBSD into VMWARE (ESX) server as a vServer ccc FreeBSD General 6 3rd October 2008 07:04 AM
PHP database interfaces TerryP Programming 6 11th September 2008 01:03 PM


All times are GMT. The time now is 04:06 PM.


Powered by vBulletin® Version 3.8.4
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Content copyright © 2007-2010, the authors
Daemon image copyright ©1988, Marshall Kirk McKusick