Recently I had to convert several text documents to XML.
To make sure that there were no empty lines with just spaces and/or tabs, I wrote the following small Perl script called 'xlblanks'.
Code:
#!/usr/bin/perl
use warnings;
use strict;
use diagnostics;
# --- delete spaces and tabs from otherwise empty lines
my $total = 0;
my $line_nr;
my @nrs;
while (<>) {
++$line_nr;
if (
s/
^ # at begin of line
[\t\ ]+ # one or more tabs or blanks
$ # followed by END OF LINE
//x # by nothing
) {
++$total;
push @nrs, $line_nr;
}
print;
}
print STDERR "\n$0: Number of lines found with only tabs or blanks: $total\n";
$, = '-' ;
print STDERR "$0: The line numbers: ", @nrs , "\n\n";
A small sample file shows no visible blanks or tabs on otherwise empty lines:
Code:
FreeBSD
DragonFlyBSD
NetBSD
OpenBSD
Running the script:
Code:
$ xlblanks blanklines.txt
FreeBSD
DragonFlyBSD
NetBSD
OpenBSD
./xlblanks: Number of lines found with only tabs or blanks: 3
./xlblanks: The line numbers: -3-5-7-
Displaying the file with 'cat' confirmed these results:
Code:
$ cat -net blanklines.txt
1 $
2 FreeBSD$
3 $
4 DragonFlyBSD$
5 ^I $
6 NetBSD $
7 ^I$
8 OpenBSD$
9 $
10 $
The two lines reporting the results are sent to STDERR, allowing to create a 'clean' version by redirecting the output to file:
Code:
$ ./xlblanks blanklines.txt >clean.txt
./xlblanks: Number of lines found with only tabs or blanks: 3
./xlblanks: The line numbers: -3-5-7-
$ cat -net clean.txt
1 $
2 FreeBSD$
3 $
4 DragonFlyBSD$
5 $
6 NetBSD $
7 $
8 OpenBSD$
9 $
10 $
The line number output sent to 'stderr' or file descriptor 2, can be redirected to file with:
Code:
$ ./xlblanks blanklines.txt >clean.txt 2> culprits.txt
$ cat culprits.txt
./xlblanks: Number of lines found with only tabs or blanks: 3
./xlblanks: The line numbers: -3-5-7-
In case you wonder why the line numbers needed to be reported:
The original master files are being maintained in MS Word format , so knowing the line numbers made it easy to eliminate those irritating, useless blanks and tabs.
An equivalent 'sed' script, without the lines reporting stuff:
Code:
$ sed -Ee 's/^[[:blank:]]+$//g' blanklines.txt | cat -net
1 $
2 FreeBSD$
3 $
4 DragonFlyBSD$
5 $
6 NetBSD $
7 $
8 OpenBSD$
9 $
10 $