Should ns.PATH be on by default?
Hey there,
Thanks for the great package!
I was wondering why ns.PATH isn't on by default when using natsorted.
It would seem intuitive that a list of paths would be naturally sorted, and actually had to debug to realize I need to use os_sorted or natsort(files, alg=PATH).
Is this intentional?
If not, I'd be happy to submit a PR
Yes, this is intentional, for two reasons.
- The
ns.PATHalgotithm is significantly slower than the default algorithm, both in the initial application of thekeyand also in the sorting itself. - There are many different definitions of "natural sort", and what seems like the obvious definition to one person is a counter-intuitive definition for another. When
natsortwas first releasedns.FLOATwas the default sorting algorithm and this caused lots of issues, and people created competing libraries because they thoughtnatsortwas buggy (instead of taking the time to understand what was going on like you have, sigh...).- I now believe the best course of action is to use the most "minimal" definition of a natural sort as the defautl, one that will cause the least amount of surprise for the most number of users, but then provide expanded definitions (like
ns.PATH) when needed.
- I now believe the best course of action is to use the most "minimal" definition of a natural sort as the defautl, one that will cause the least amount of surprise for the most number of users, but then provide expanded definitions (like
Thanks for the clarifcation Seth!
Do you have perhaps an example of where this logic is counterintuitive?
I had a list of paths that looks like /a/b/1/something, /a/b/2/something, etc, which I expected would correctly be sorted with nat-sort, but /a/b/10/something followed /a/b/1/something unfortunately
Hmm... I'm confused. "10" should follow "1", and also for the example values you provided the default sorting and ns.PATH provide the same results.
% python
>>> x = ["/a/b/1/something", "/a/b/2/something", "/a/b/10/something"]
>>> sorted(x)
['/a/b/1/something', '/a/b/10/something', '/a/b/2/something']
>>> import natsort
>>> natsort.natsorted(x)
['/a/b/1/something', '/a/b/2/something', '/a/b/10/something']
>>> natsort.natsorted(x, alg=natsort.PATH)
['/a/b/1/something', '/a/b/2/something', '/a/b/10/something']
>>> natsort.__version__
'8.1.0'
Perhaps you could elaborate more on the problem you actually encountered?
Also, to be clear, ns.PATH was added to support the problem described here: https://github.com/SethMMorton/natsort/issues/3. Unless you have data that looks like this then ns.PATH shouldn't actually be needed.
Interesting, it seems the issue lies with using Path objects for sorting:
>>> from natsort import natsorted, PATH
>>> from pathlib import Path
>>> x = ["/a/b/1/something", "/a/b/2/something", "/a/b/10/something"]
>>> natsorted(x)
['/a/b/1/something', '/a/b/2/something', '/a/b/10/something']
>>> natsorted(Path(f) for f in x)
[WindowsPath('/a/b/1/something'), WindowsPath('/a/b/10/something'), WindowsPath('/a/b/2/something')]
>>> natsorted((Path(f) for f in x), alg=PATH)
[WindowsPath('/a/b/1/something'), WindowsPath('/a/b/2/something'), WindowsPath('/a/b/10/something')]
This matches my case since I was sorting the output of d.iterdir() directly. I misattributed the cause.
Perhaps pathlib.Path should be special-cased to be cast to a string during key generation?
This would be useful in my case at least, and to me sounds more intuitive than the current behaviour. WDYT?
Ah yes. This is expected but surprising behavior (see https://github.com/SethMMorton/natsort/issues/16 for someone else suggesting something similar). The suggestion in that issue was to auto-use ns.PATH if Path objects are found, but that won't work (trust me on this).
However, auto-converting PurePath objects to str if ns.PATH is not given would be fine, so long as it does not significantly degrade performance. If you want to submit a PR to do that that would be great.
<unrelated rant that is not your fault> I really feel like the decision to make the pathlib.Path object not a subclass of str was a bad design decision that has caused more trouble than was worth. Here is just one of the ramifications. </unrelated rant that is not your fault>
Closed by #153