gz-sort
gz-sort copied to clipboard
utf sorting working differently than gnu sort?
i am noticing utf input files are sorting much differently in gz-sort than gnu sort. is this a known issue and is there some plan to make gz-sort compatible as a drop in replacement for gnu sort with respect to utf input? thanks!
note, i am attaching a sample input file.
$ cat utf1000.txt | sort | md5sum e48750df42a4b31030f63d7b61ab2bc7 -
$ cat utf1000.txt | gz-sort | md5sum eb25fdf69e602183470f7377e0864b62 -
gz-sort does not support UTF-8 at this time. It only supports LANG=C, which is a simple byte-wise sort. Proper UTF-8 sorting really slows things down too. I would probably merge a patch for optional support though.