helix
helix copied to clipboard
Add 'alphanumeric sort' commands
Implements #4153 with the crate mentioned in the discussion of the issue.
This adds 2 new commands, sortn and rsortn, corresponding to the existing ones (sort and rsort), and they sort according to https://docs.rs/alphanumeric-sort/1.4.4/alphanumeric_sort/fn.compare_str.html
It's a short (~100L) block of code within the library and the library includes a bunch of other functions I don't think will be used in the future so I think we should vendor this block https://github.com/magiclen/alphanumeric-sort/blob/bea0284409f68407dd25b432fd901b62f6e944e7/src/lib.rs#L95-L199
I think that https://github.com/Aloso/lexical-sort might be better since it claims to support unicode.
Edit: Actually, this is probably good enough.
I removed the dependency, and added a function to compare the strings.
When I copied the source, I realized the handling of long numbers was incorrect (it accumulates it in a f64, thus rounds it). I looked at the other crate, and it had a similar problem (accumulating the number in a u64).
So I implemented my own function. It orders numbers with leading zeroes like the crate alphanumeric-sort, but uses the slice containing the digits instead of a f64 for the comparison
Just to note, I think that using a u64 would've been okay. len() returns a usize which is equivalent to u64 on 64bit systems. Assuming that each number is equivalent to a byte, then it could handle 16,384 petabytes of data.
By the way, make sure to run cargo xtask docgen and commit the changes to make CI pass.
Thanks for the hints 🙂
I think I have not explained the problem with the u64 correctly. When encountering a number,lexical-sort parses the number into a u64, digit by digit, it multiplies the number by 10 and add the current digit (at this line). So for example "18446744073709551616" (=2^64) would cause an overflow. Numbers this long are not very common, but may still happen
I was looking into this kind of thing for a personal project and I don't think that alphanumeric-sort provides true lexicographic ordering. For that, we'd want the Unicode Collation algorithm. I'm currently a little less than halfway into my naive implementation of the algorithm that would be fully compliant so I may post a PR for it in the future. Alternatively, we could use an external crate like https://crates.io/crates/feruca