View Single Post
Old 9th April 2009
gosha gosha is offline
Spam Deminer
 
Join Date: Jun 2008
Location: China
Posts: 256
Default perl sorting non ascii chars SOLVED

I know, you want me to study and work it out by myself.
Actually I've finally found a good tutorial page on this, I simply did not search with the right key before on google. Here's the link: http://interglacial.com/~sburke/tpj/as_html/tpj14.html

In my personal case, I have two extra letters to sort: š and ū.
I've made this test file:
Code:
abc
aab
bbc
mmn
lmn
aaa
ššš
sss
zzz
ccc
ggg
uuu
šas
saš
cab
uuū
ūuu
ūūū
Here's the code:
Code:
use strict;
use warnings;
open (_file_, "< absolute-path-to-file")  or  die "Failed to read file : $! ";
my @not_sorted = <_file_>; 
sub normalize {
   my $in = $_[0];
   $in = lc($in);
   $in =~ tr<aeiouū>
   <aeiouu>;
   $in =~ tr<abcdefghijklmnopqrsštuvwxyz>
   <\x01-\x1B>; #hexadecimal numbers to tell Perl you have 27 letters to sort
   return $in;
}
my @sorted  = sort{ normalize($a) cmp normalize($b)or $a cmp $b} @not_sorted;
print @sorted;
close (_file_);
I still don't completely understand why you can sort in proper order ū not considering it an extra letter like you have to do for š, but I evenctually will in the future. Anyway it gives the expected result:
Code:
aaa
aab
abc
bbc
cab
ccc
ggg
lmn
mmn
saš
sss
šas
ššš
uuu
uuū
ūuu
ūūū
zzz
Hope it will be helpful to someone.
Reply With Quote