Natural sorting option for file_server file `browse` file listings
Issue Details
For files with regions of numbers and other characters, it's intuitive to have the numbers sorted by increasing numeric value rather than by ASCII code. (Jeff Atwood has a nice article here: https://blog.codinghorror.com/sorting-for-humans-natural-sort-order/)
Current behavior:
[chandler@oscar:~/projects/caddy/cmd/caddy]$ ls test
foo1 foo10 foo100 foo101 foo199 foo2 foo200 foo201 foo3 foo99
[chandler@oscar:~/projects/caddy/cmd/caddy]$ go run . file-server --root test --browse --listen :8000
Desired behavior:
[chandler@oscar:~/projects/caddy/cmd/caddy]$ ls -v test
foo1 foo2 foo3 foo10 foo99 foo100 foo101 foo199 foo200 foo201
[chandler@oscar:~/projects/caddy/cmd/caddy]$ go run . file-server --root test --browse --listen :8000
With the idea that code can be more precise than description, here's a likely-not-production-ready patch that accomplishes natural sorting. I'd be happy to PR in any variant of this if that's desirable.
I stole the code without modification from https://github.com/fvbommel/sortorder/blob/main/natsort.go (MIT licensed), as a quick skim suggested it would do what we wanted and be more performant than other implementations (e.g. https://github.com/facette/natsort/blob/master/natsort.go, https://github.com/skarademir/naturalsort/blob/master/naturalsort.go) that used regular expressions for chunk splitting. https://github.com/maruel/natural/blob/main/natsort.go may also be comparable in performance though I haven't reviewed it for correctness, and I haven't done any performance testing on any of these options, though a benchmark claims the specific one I included may be within a factor of two on performance vs strings.Sort.
Potential patch:
diff --git a/modules/caddyhttp/fileserver/browsetplcontext.go b/modules/caddyhttp/fileserver/browsetplcontext.go
index b9489c6a..339eea21 100644
--- a/modules/caddyhttp/fileserver/browsetplcontext.go
+++ b/modules/caddyhttp/fileserver/browsetplcontext.go
@@ -325,11 +325,72 @@ type (
byTime browseTemplateContext
)
+func isDigit(b byte) bool { return '0' <= b && b <= '9' }
+
+// naturalLess compares two strings using natural ordering. This means that e.g.
+// "abc2" < "abc12".
+//
+// Non-digit sequences and numbers are compared separately. The former are
+// compared bytewise, while digits are compared numerically (except that
+// the number of leading zeros is used as a tie-breaker, so e.g. "2" < "02")
+//
+// Limitation: only ASCII digits (0-9) are considered.
+func naturalLess(str1, str2 string) bool {
+ idx1, idx2 := 0, 0
+ for idx1 < len(str1) && idx2 < len(str2) {
+ c1, c2 := str1[idx1], str2[idx2]
+ dig1, dig2 := isDigit(c1), isDigit(c2)
+ switch {
+ case dig1 != dig2: // Digits before other characters.
+ return dig1 // True if LHS is a digit, false if the RHS is one.
+ case !dig1: // && !dig2, because dig1 == dig2
+ // UTF-8 compares bytewise-lexicographically, no need to decode
+ // codepoints.
+ if c1 != c2 {
+ return c1 < c2
+ }
+ idx1++
+ idx2++
+ default: // Digits
+ // Eat zeros.
+ for ; idx1 < len(str1) && str1[idx1] == '0'; idx1++ {
+ }
+ for ; idx2 < len(str2) && str2[idx2] == '0'; idx2++ {
+ }
+ // Eat all digits.
+ nonZero1, nonZero2 := idx1, idx2
+ for ; idx1 < len(str1) && isDigit(str1[idx1]); idx1++ {
+ }
+ for ; idx2 < len(str2) && isDigit(str2[idx2]); idx2++ {
+ }
+ // If lengths of numbers with non-zero prefix differ, the shorter
+ // one is less.
+ if len1, len2 := idx1-nonZero1, idx2-nonZero2; len1 != len2 {
+ return len1 < len2
+ }
+ // If they're equally long, string comparison is correct.
+ if nr1, nr2 := str1[nonZero1:idx1], str2[nonZero2:idx2]; nr1 != nr2 {
+ return nr1 < nr2
+ }
+ // Otherwise, the one with less zeros is less.
+ // Because everything up to the number is equal, comparing the index
+ // after the zeros is sufficient.
+ if nonZero1 != nonZero2 {
+ return nonZero1 < nonZero2
+ }
+ }
+ // They're identical so far, so continue comparing.
+ }
+ // So far they are identical. At least one is ended. If the other continues,
+ // it sorts last.
+ return len(str1) < len(str2)
+}
+
func (l byName) Len() int { return len(l.Items) }
func (l byName) Swap(i, j int) { l.Items[i], l.Items[j] = l.Items[j], l.Items[i] }
func (l byName) Less(i, j int) bool {
- return strings.ToLower(l.Items[i].Name) < strings.ToLower(l.Items[j].Name)
+ return naturalLess(strings.ToLower(l.Items[i].Name), strings.ToLower(l.Items[j].Name))
}
func (l byNameDirFirst) Len() int { return len(l.Items) }
@@ -338,7 +399,7 @@ func (l byNameDirFirst) Swap(i, j int) { l.Items[i], l.Items[j] = l.Items[j], l.
func (l byNameDirFirst) Less(i, j int) bool {
// sort by name if both are dir or file
if l.Items[i].IsDir == l.Items[j].IsDir {
- return strings.ToLower(l.Items[i].Name) < strings.ToLower(l.Items[j].Name)
+ return naturalLess(strings.ToLower(l.Items[i].Name), strings.ToLower(l.Items[j].Name))
}
// sort dir ahead of file
return l.Items[i].IsDir
Assistance Disclosure
AI not used
If AI was used, describe the extent to which it was used.
No response
Can I work on this??
It looks like @ChandlerSwift already made an implementation?
"made an implementation" may be a bit strong—I will say that I have a hack that appears to do what I need! 🙂 This is my first time making any changes to Caddy, so I figured I'd solicit some feedback before I'd consider any of this mergeable.
A few specific concerns I have:
- Performance: What kind of testing should I be doing to make sure that this doesn't offer any (substantial) performance regressions? Assuming this does add some runtime cost, how much is tolerable? (may depend on the next point)
- Configurability: Should this behavior be the default, or should it be an option to be enabled? If so, should the administrator configure it (say, in the Caddyfile), or should this be exposed to the person viewing the HTML output, as e.g. sort order is?
- Licensing: The snippet I pulled in is MIT licensed; I think that means that if I include the requisite notice we're safe (since sublicensing is allowed, specifically), but I do want to confirm before pulling in additional code.
- Code organization: I haven't read enough of Caddy's code to get a feel for how code should be organized; should I break this off into a separate file or something?
- Testing: Should I be adding any tests along with these changes?
If this is something that looks like it might be accepted into Caddy, I'd also be happy to open up a PR if that will make discussion easier! I just figured I'd lead with an issue, per CONTRIBUTING.md: https://github.com/caddyserver/caddy/blob/806fef85bedfc3fe178560afe6212c8ae90d3ebe/.github/CONTRIBUTING.md?plain=1#L32
Performance: What kind of testing should I be doing to make sure that this doesn't offer any (substantial) performance regressions? Assuming this does add some runtime cost, how much is tolerable? (may depend on the next point)
I actually use https://pkg.go.dev/github.com/maruel/natural, which you mentioned above, in another project already, and so far it seems to be correct and efficient.
Configurability: Should this behavior be the default, or should it be an option to be enabled? If so, should the administrator configure it (say, in the Caddyfile), or should this be exposed to the person viewing the HTML output, as e.g. sort order is?
I think it's okay if it's the default. We can add configurability if it's requested. But I don't know if anyone cares at this point.
Maybe only the HTML output should be natural sorted though.
Licensing: The snippet I pulled in is MIT licensed; I think that means that if I include the requisite notice we're safe (since sublicensing is allowed, specifically), but I do want to confirm before pulling in additional code.
Yes I think so; so is the maruel library.
Code organization: I haven't read enough of Caddy's code to get a feel for how code should be organized; should I break this off into a separate file or something?
Wouldn't hurt, yeah.
Testing: Should I be adding any tests along with these changes?
We always approve of writing tests! But, you don't need to test the library code.
Please feel free to submit a PR at this point and we'll review it!
I've tested the implementation for my own web/fileserver.
With normal version number style filenames, things look good, but when numbers are real decimal numbers, it doesn't quite work. I think it is "how it should be" in that the algorithm assumes more digits means larger numbers. But in the context of a real decimal number it is not correct.
Version number sort means:
- 1.0.1
- 1.0.2
- 1.0.11
- 1.1.0
- 1.2.0
Decimal number sort should be:
- 0.100
- 0.10001
- 0.1001
- 0.101
- 0.123
Perhaps there ought to be a Caddyfile option for specifying what type of sort that should be applied for specific directory trees?
Good example:
https://mirrors.tnonline.net/btrfs/btrfs-progs/sources/kernel.org/
Bad example:
https://mirrors.tnonline.net/misc/img-compare/001_Juvenile_leopard_in_the_Serengeti_National_Park_Photo_by_Giles_Laurent/640x427/
Oof... yeah, this is one thing that I don't like about natural sort, and which is why I have hesitated to merge the associated PR... the algorithm is ambiguous, like it's not clear how to sort properly in certain situations.