sorting special characters

gosha · #1 **(View Single Post)** 20th March 2009

Hello everybody!
Here I am again with my formatting and sorting problems.

How do I tell sort to put "š" just after "s" and not at the end after "z"?
Or is there a better way than using sort?

ocicat · #2 **(View Single Post)** 20th March 2009

Quote:

Originally Posted by gosha

How do I tell sort to put "š" just after "s" and not at the end after "z"?

According to the manpage for sort(1), it only sorts lexicographically. The only other knob is to ignore case.

Quote:

Or is there a better way than using sort?

Use awk(1), perl(1), or some other scripting language which will allow writing your own custom sorting routine.

The standard tools allow for standard usage. Anything beyond this is better done through more sophisticated options.

gosha · #3 **(View Single Post)** 20th March 2009

Thanks, I guess it's time to start awk and perl

ocicat · #4 **(View Single Post)** 20th March 2009

Quote:

Originally Posted by gosha

Thanks, I guess it's time to start awk and perl

Given many of your recent questions, either language (as well as Python...) can do the job as there is overlap in their functionalities. What you should do next is look at a little of each & determine which seems more intuitive in terms of syntax, usage, script construction, etc.

Recognize that if you are simply wanting custom sorting, then awk(1) may very well be your best choice (for now...). However, if you continue down this path of wanting custom scripts for this or that need, then you should begin assessing which language meets your more long term goals & go with the best choice. It takes time & effort to mount the learning curve of any language, & continually flipping from one choice to the next is counterproductive.

jggimi · #5 **(View Single Post)** 20th March 2009

Isn't it wonderful how XKCD always has something applicable?

perl: http://xkcd.com/208/
python: http://xkcd.com/353/

ocicat · #6 **(View Single Post)** 20th March 2009

Final comment (on this subject...).

I am not aware of any awk-specific mailing lists or help sites, but then, I have never had need of one myself, so I haven't done extensive searching.

However, if you choose Perl and/or Python, consider the following.

For Perl:
- The Perl Cookbook is indispensable:
  
  http://oreilly.com/catalog/9780596003135/
  
  Other books, such as Programming Perl is a good resource:
  
  http://oreilly.com/catalog/9780596000271/
  
  One of the better beginning Perl books is Learning Perl:
  
  http://oreilly.com/catalog/9780596520106/
- No connoisseur of Perl should progress without knowledge of the Perl Monks Website:
  
  http://perlmonks.org/
For Python:
- As for books, Learning Python is a good choice for newcomers:
  
  http://oreilly.com/catalog/9780596513986/
  
  Note that Dive into Python is available online:
  
  http://www.diveintopython.org/
- http://python.org/ has lots of documentation -- even oriented to the newbie.
- The python-help@ can be of great use to newcomers. Information on subscribing can be found at the following link:
  
  http://www.python.org/community/lists/
- A site previously mentioned:
  
  http://www.ibm.com/developerworks/
  
  ...happens to have a number of good Python articles written by some influential members of the Python community.

Although this partial/simple book list is O'Reilly-centric, O'Reilly cornered the market when it comes to Perl titles. Other good non-O'Reilly titles exist, but when starting out with the language, staying with the animal books is a reasonable choice.

As for Python, O'Reilly has some good titles, but they did not capture the Python book market as they did with Perl. Python came out after Perl, & the industry was at a different point in its maturation. These may be contributing factors as to the difference.

gosha · #7 **(View Single Post)** 20th March 2009

Thanks a lot for your suggestions.
I think right now I might first use awk, which seems from the outside "smaller" and "simpler", but then I'll have to learn at least Perl. In fact, yesterday I've found a converting tool (Encode::HanConvert) which I will need very often to convert simplified chinese characters to traditional ones and vice-versa. This tool is in Perl, so I guess it has all I need. As far as Python goes, I presently cannot understand the difference between the two, so maybe with time I will.

gosha · #8 **(View Single Post)** 20th March 2009

jggmi, the comics are really nice

ocicat · #9 **(View Single Post)** 20th March 2009

Quote:

Originally Posted by gosha

As far as Python goes, I presently cannot understand the difference between the two, so maybe with time I will.

From the perspective of the English speaking hordes, Python's syntax is more "English"-like without the plethora of special characters & special nuisances required by other languages (specifically Perl). Some find this minimized amount of computer science cruft makes Python easier to write than other languages modeled on C (like Perl). Personally, I don't have such misgivings about Perl, but I know many that do.

How this "ease of use" translates to those speaking Chinese is unknown to me. Maybe the simplicity doesn't translate at all.

As for the goals of both languages, they are very similar, but Perl comes from a heritage inheriting the syntax & mindset of both shell & C programming. Python doesn't duplicate this lineage.

And for what it is worth, awk also inherits various idiosyncrasies from both shell & C programming. awk has a lot of power & served as a prominent scripting language alternative until Perl (& later Python...) arrived on the scene.

gosha · #10 **(View Single Post)** 20th March 2009

Well, I'm neither English nor Chinese mother tongue, so the "Englishness" does not make a big difference to me. Maybe with time I might learn all the three languages, but now I'll go first for awk and then Perl, and if its syntax is similar to shell and C, it will also help me understan Unix better, I think.

drl · #11 **(View Single Post)** 21st March 2009

Hi.

Quote:

Originally Posted by ocicat

Final comment (on this subject...).

I am not aware of any awk-specific mailing lists or help sites, but then, I have never had need of one myself, so I haven't done extensive searching.

...

There is a lot of information at http://awk.info/

I use awk mostly for field-related, single-shot programs. If I needed advice, I would ask at http://www.unix.com/shell-programming-scripting/ -- that's a hot-bed of awk questions and answers. I have seen some very complex and creative solutions there, as well as gentle answers for novice users. As usual, it is in one's best interest to try to solve a problem first, then -- as necessary -- post sample input, desired results, and actual results.

That forum is also good for perl questions.

Best wishes ... cheers, drl

Carpetsmoker · #12 **(View Single Post)** 21st March 2009

Quote:

Originally Posted by gosha

Well, I'm neither English nor Chinese mother tongue, so the "Englishness" does not make a big difference to me. Maybe with time I might learn all the three languages, but now I'll go first for awk and then Perl, and if its syntax is similar to shell and C, it will also help me understan Unix better, I think.

This is not what ocicat meant, he meant that python is more like a natural language (ANY language), and has less syntax, for example python doesn't require a semicolon (

at the end of each statement, python doesn't require curly braces ({ }) and parenthesis ( () ) at many places that most other languages do, and so forth.

This is very different from other languages which sometimes require excessive parenthesis (*cough* lisp *cough*).
The syntax of many languages seems to be designed so that the parser/compiler can easily understand&read the language, python syntax is designed so that it is easier for humans to understand&read the language ... This may make the compiler slightly harder to write, but you only write a compiler once, and you write code many times.

gosha · #13 **(View Single Post)** 21st March 2009

I see, thank you for the explanation.
In the meantime, if anyone could direct me to the relevant part of awk or perl I should study first to solve my sorting problem, I'd be very grateful (could not find it on google).

IdOp · #14 **(View Single Post)** 21st March 2009

awk is great; the syntax is a lot like C, only less finicky about declarations. So if you know C you can get started quickly (and perhaps vice versa).

But long ago in my first brushes with awk, I was very confused and bogged down in the command-line syntax, patterns and pre-defined variables. The big picture was missing, and it really didn't start to click until I realized a simple analogy that made it clear.

So here's my mini-contribution to awk 101

. (For those who know awk, allow me the leniency of over-simplification in descrbing this analogy.) In a language like C, the functions have names. The code within the function block gets executed when the function is called by name, either from another such function or from main().

The analogy is that awk is like this, except the "functions" don't have a name: instead they have a pattern associated with them. The code in a "function" block gets executed when the pattern matches (part of) an input-data line.

To me, that's awk in a nutshell, the rest is details.

(Of course, the "functions" are called "action statements" and awk does have named functions of its own just like in C.)

Happy awking!

gosha · #15 **(View Single Post)** 7th April 2009

ok guys, I started to read tutorials and all. I've also found a tool which should help with this sorting of mine: Unicode::Collate (from cpan).
now I have this test file:

Code:

abc
aab
bbc
mmn
lmn
aaa
ššš
sss
zzz

if I sort it I get this:

Code:

$ sort test
aaa
aab
abc
bbc
lmn
mmn
sss
zzz
ššš

if I sort it with this Perl script I worked out with the usage indications of Unicode::Collate, I get this (and it's really slow!):

Code:

aaa
ššš
aab
abc
bbc
lmn
mmn
sss
zzz

As you see, the "ššš" are not after "z" which is already an improvement, but they should be right after "s".
Do I have to explicitly tell Perl where to put them? How?
Here's the script (don't laugh too loud

):

Code:

use Unicode::Collate;
$Collator = Unicode::Collate->new(%tailoring);
open (NAMES_FILE, "< path-to-my-file")  or  die "Failed to read file : $! ";
my @not_sorted = <NAMES_FILE>;  # read entire file in the array
@sorted  = $Collator->sort(@not_sorted);
print @sorted;
close (NAMES_FILE);

This is the synoposis of Unicode::Collate, but I'm not grasping it very well yet:
use Unicode::Collate;

Code:

  #construct
  $Collator = Unicode::Collate->new(%tailoring);

  #sort
  @sorted = $Collator->sort(@not_sorted);

  #compare
  $result = $Collator->cmp($a, $b); # returns 1, 0, or -1.

  # If %tailoring is false (i.e. empty),
  # $Collator should do the default collation.

gosha · #16 **(View Single Post)** 9th April 2009

I know, you want me to study and work it out by myself.
Actually I've finally found a good tutorial page on this, I simply did not search with the right key before on google. Here's the link: http://interglacial.com/~sburke/tpj/as_html/tpj14.html

In my personal case, I have two extra letters to sort: š and ū.
I've made this test file:

Code:

abc
aab
bbc
mmn
lmn
aaa
ššš
sss
zzz
ccc
ggg
uuu
šas
saš
cab
uuū
ūuu
ūūū

Here's the code:

Code:

use strict;
use warnings;
open (_file_, "< absolute-path-to-file")  or  die "Failed to read file : $! ";
my @not_sorted = <_file_>; 
sub normalize {
   my $in = $_[0];
   $in = lc($in);
   $in =~ tr<aeiouū>
   <aeiouu>;
   $in =~ tr<abcdefghijklmnopqrsštuvwxyz>
   <\x01-\x1B>; #hexadecimal numbers to tell Perl you have 27 letters to sort
   return $in;
}
my @sorted  = sort{ normalize($a) cmp normalize($b)or $a cmp $b} @not_sorted;
print @sorted;
close (_file_);

I still don't completely understand why you can sort in proper order ū not considering it an extra letter like you have to do for š, but I evenctually will in the future. Anyway it gives the expected result:

Code:

aaa
aab
abc
bbc
cab
ccc
ggg
lmn
mmn
saš
sss
šas
ššš
uuu
uuū
ūuu
ūūū
zzz

Hope it will be helpful to someone.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Input foreign characters under X11	Beastie	General software and network	5	30th August 2009 11:51 AM
ls sorting of numbered files	gosha	General software and network	6	11th April 2009 01:07 PM
Username longer than 16 characters	_hmp_	FreeBSD General	5	13th January 2009 10:01 AM
Sorting Packages	JMJ_coder	NetBSD Package System (pkgsrc)	3	20th May 2008 01:08 AM