View Single Post
Old 7th April 2009
gosha gosha is offline
Spam Deminer
 
Join Date: Jun 2008
Location: China
Posts: 256
Default sorting non ascii chars

ok guys, I started to read tutorials and all. I've also found a tool which should help with this sorting of mine: Unicode::Collate (from cpan).
now I have this test file:
Code:
abc
aab
bbc
mmn
lmn
aaa
ššš
sss
zzz
if I sort it I get this:
Code:
$ sort test
aaa
aab
abc
bbc
lmn
mmn
sss
zzz
ššš
if I sort it with this Perl script I worked out with the usage indications of Unicode::Collate, I get this (and it's really slow!):
Code:
aaa
ššš
aab
abc
bbc
lmn
mmn
sss
zzz
As you see, the "ššš" are not after "z" which is already an improvement, but they should be right after "s".
Do I have to explicitly tell Perl where to put them? How?
Here's the script (don't laugh too loud):
Code:
use Unicode::Collate;
$Collator = Unicode::Collate->new(%tailoring);
open (NAMES_FILE, "< path-to-my-file")  or  die "Failed to read file : $! ";
my @not_sorted = <NAMES_FILE>;  # read entire file in the array
@sorted  = $Collator->sort(@not_sorted);
print @sorted;
close (NAMES_FILE);
This is the synoposis of Unicode::Collate, but I'm not grasping it very well yet:
use Unicode::Collate;

Code:
  #construct
  $Collator = Unicode::Collate->new(%tailoring);

  #sort
  @sorted = $Collator->sort(@not_sorted);

  #compare
  $result = $Collator->cmp($a, $b); # returns 1, 0, or -1.

  # If %tailoring is false (i.e. empty),
  # $Collator should do the default collation.
Reply With Quote