helix icon indicating copy to clipboard operation
helix copied to clipboard

Add 'alphanumeric sort' commands

Open seb-bl opened this issue 3 years ago • 6 comments

Implements #4153 with the crate mentioned in the discussion of the issue.

This adds 2 new commands, sortn and rsortn, corresponding to the existing ones (sort and rsort), and they sort according to https://docs.rs/alphanumeric-sort/1.4.4/alphanumeric_sort/fn.compare_str.html

seb-bl avatar Oct 15 '22 11:10 seb-bl

It's a short (~100L) block of code within the library and the library includes a bunch of other functions I don't think will be used in the future so I think we should vendor this block https://github.com/magiclen/alphanumeric-sort/blob/bea0284409f68407dd25b432fd901b62f6e944e7/src/lib.rs#L95-L199

the-mikedavis avatar Oct 15 '22 16:10 the-mikedavis

I think that https://github.com/Aloso/lexical-sort might be better since it claims to support unicode.

Edit: Actually, this is probably good enough.

kirawi avatar Oct 15 '22 17:10 kirawi

I removed the dependency, and added a function to compare the strings.

When I copied the source, I realized the handling of long numbers was incorrect (it accumulates it in a f64, thus rounds it). I looked at the other crate, and it had a similar problem (accumulating the number in a u64).

So I implemented my own function. It orders numbers with leading zeroes like the crate alphanumeric-sort, but uses the slice containing the digits instead of a f64 for the comparison

seb-bl avatar Oct 15 '22 22:10 seb-bl

Just to note, I think that using a u64 would've been okay. len() returns a usize which is equivalent to u64 on 64bit systems. Assuming that each number is equivalent to a byte, then it could handle 16,384 petabytes of data.

kirawi avatar Oct 15 '22 22:10 kirawi

By the way, make sure to run cargo xtask docgen and commit the changes to make CI pass.

kirawi avatar Oct 15 '22 22:10 kirawi

Thanks for the hints 🙂

I think I have not explained the problem with the u64 correctly. When encountering a number,lexical-sort parses the number into a u64, digit by digit, it multiplies the number by 10 and add the current digit (at this line). So for example "18446744073709551616" (=2^64) would cause an overflow. Numbers this long are not very common, but may still happen

seb-bl avatar Oct 15 '22 22:10 seb-bl

I was looking into this kind of thing for a personal project and I don't think that alphanumeric-sort provides true lexicographic ordering. For that, we'd want the Unicode Collation algorithm. I'm currently a little less than halfway into my naive implementation of the algorithm that would be fully compliant so I may post a PR for it in the future. Alternatively, we could use an external crate like https://crates.io/crates/feruca

kirawi avatar May 30 '24 13:05 kirawi