exa icon indicating copy to clipboard operation
exa copied to clipboard

Sorting is not locale based or unicode aware

Open fer22f opened this issue 5 years ago • 9 comments

So my locale is pt_BR.utf-8, issuing ls gives the following output, which sorts accents correctly:

$ ls -l
drwxr-xr-x  2 fernando fernando   4096 jun 26 17:02 'área de trabalho'
drwxr-xr-x  3 fernando fernando   4096 jun 26 15:28  documentários
drwxr-xr-x  2 fernando fernando   4096 jul  6 17:40  documentos
drwxr-xr-x 11 fernando fernando   4096 ago 28 09:29  downloads
...

However, exa doesn't:

$ exa -l
drwxr-xr-x    - fernando  6 jul 17:40 documentos
drwxr-xr-x    - fernando 26 jun 15:28 documentários
drwxr-xr-x    - fernando 28 ago  9:29 downloads
...
drwxr-xr-x    - fernando 26 jun 17:02 área de trabalho

Another example is how ls handles punctuation:

$ ls -l
-rw-r--r-- 1 fernando fernando  773 ago 21 00:30 Reactive-Extensions-Examples.md
-rw-r--r-- 1 fernando fernando  169 ago 18 14:27 _sidebar.md
-rw-r--r-- 1 fernando fernando 1425 ago 17 20:10 Summary-of-Simplicity-Matters.md

While exa gives a different ordering:

$ exa -l
.rw-r--r--  169 fernando 18 ago 14:27 _sidebar.md
.rw-r--r--  773 fernando 21 ago  0:30 Reactive-Extensions-Examples.md
.rw-r--r-- 1,4k fernando 17 ago 20:10 Summary-of-Simplicity-Matters.md

Which is actually different from what I was expecting, as ls -v sorts _ in the end:

$ ls -lv
-rw-r--r-- 1 fernando fernando  773 ago 21 00:30 Reactive-Extensions-Examples.md
-rw-r--r-- 1 fernando fernando 1425 ago 17 20:10 Summary-of-Simplicity-Matters.md
-rw-r--r-- 1 fernando fernando  169 ago 18 14:27 _sidebar.md

(it seems to me that exa ignores case differently here; while ls uppercases everything, exa downcases everything).

fer22f avatar Aug 29 '18 02:08 fer22f

There is no easy way to sort according to a locale in Rust. Sorting is simply handled by Natord.

I’d like to see some collation library in pure Rust or good bindings to ICU, but until then I don’t think we can do anything about it.

ariasuni avatar Dec 16 '18 05:12 ariasuni

oof, as someone who also uses a non-C locale this unfortunately doesn't make exa a suitable ls replacement yet

DanScharon avatar Sep 04 '20 09:09 DanScharon

ICU4X has been announced and could be the solution to this problem. It will probably be a long while until a production-ready version is released, though.

ariasuni avatar Oct 28 '20 21:10 ariasuni

Note that this bug has been reported to Debian as well: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=950862

ariasuni avatar Apr 18 '21 03:04 ariasuni

(I know this is a closed issue, but figured I'd toss a quick update in here for anyone looking through this later on.)

ICU4X 0.4 was released 2021-11-01, and the current roadmap and requirements doc project a 1.0 release around Q2 of 2022.

nestor-custodio avatar Dec 21 '21 13:12 nestor-custodio

It’s not a closed issue at all. I would be thrilled to integrate ICU4X into exa, if it matches our needs. Unfortunately, it seems that there’s still no support for anything related to collation, so even if we use for other things (like datetime), this issue won’t be solved anytime soon, I’m afraid.

Edit: ah, I see that a Collator component is on their roadmap for 0.5, not sure when it’ll happen but yeah hopefully we can use that sometime in 2022 :crossed_fingers:

ariasuni avatar Dec 27 '21 13:12 ariasuni

ICU4X 0.6 was released and 1.0 is in beta. The changelog doesn't mention collation but the source repository seem to include references to collation. Do you know if it's usable by now?

hydrargyrum avatar Sep 19 '22 18:09 hydrargyrum

The ICU4X Collator component work keeps getting bumped and didn't make it into the 0.6 release. It's currently slated for 1.0, but the current 1.0-beta only includes a partial implementation, so we'll just have to see.

nestor-custodio avatar Sep 22 '22 14:09 nestor-custodio