gz-sort icon indicating copy to clipboard operation
gz-sort copied to clipboard

utf sorting working differently than gnu sort?

Open gripedthumbtacks opened this issue 9 years ago • 1 comments

i am noticing utf input files are sorting much differently in gz-sort than gnu sort. is this a known issue and is there some plan to make gz-sort compatible as a drop in replacement for gnu sort with respect to utf input? thanks!

note, i am attaching a sample input file.

$ cat utf1000.txt | sort | md5sum e48750df42a4b31030f63d7b61ab2bc7 -

$ cat utf1000.txt | gz-sort | md5sum eb25fdf69e602183470f7377e0864b62 -

utf1000.txt

gripedthumbtacks avatar Sep 10 '16 03:09 gripedthumbtacks

gz-sort does not support UTF-8 at this time. It only supports LANG=C, which is a simple byte-wise sort. Proper UTF-8 sorting really slows things down too. I would probably merge a patch for optional support though.

keenerd avatar Oct 09 '17 16:10 keenerd