`ls`: Investigate printing performance
In seq, we found that directly calling stdout.write_all(str.as_bytes())? is quite a bit faster than using format to do the same operation: write!(stdout, "{str}")?. https://github.com/uutils/coreutils/pull/7562
ls uses a lot of write!(..., "{}", ..)?; patterns. Would be nice to know if switching to write_all would improve performance.
Had a very quick look at samply output, and printing dominate more of the runtime in long format outputs (e.g. ls -l), so maybe it'd be good to start investigating that use case.
For reference, we're doing a bit worse than coreutils:
$ cargo build -r -p uu_ls && taskset -c 0 hyperfine --warmup 3 -L ls target/release/ls,ls "{ls} -lR .git"
Finished `release` profile [optimized] target(s) in 0.12s
Benchmark 1: target/release/ls -lR .git
Time (mean ± σ): 32.5 ms ± 0.8 ms [User: 18.7 ms, System: 13.5 ms]
Range (min … max): 31.2 ms … 34.4 ms 84 runs
Benchmark 2: ls -lR .git
Time (mean ± σ): 23.7 ms ± 1.5 ms [User: 11.3 ms, System: 12.0 ms]
Range (min … max): 23.0 ms … 38.6 ms 114 runs
Summary
ls -lR .git ran
1.37 ± 0.10 times faster than target/release/ls -lR .git
are you going to work on it ? :)
I think I'm still deep into printf/seq issues for a while ,-) Happy if somebody else gets to it.
i tried to benchmark it and running hyperfine --warmup 3 --min-runs 1000 --max-runs 10000 -L ls target/release/ls,../clean_coreutils/target/release/ls "{ls} -lR .git" gave me very inconclusive results
inconsistenly the current version was up to 8-16% quicker than a version in which i changed all write!(out, ...) calls to stdout().write_all(...) calls
i havent tried with write!() calls which are not to the stdout since i first wanted to just check the changes for this only but maybe you could shed some light if i did something wrong or maybe it just is a dead end?
Ha, that's a bit surprising! Do you have your branch pushed somewhere for us to have a look? (no need to make a PR)
i had apparently deleted or stashed the changes somewhere but redid it quickly
https://github.com/cerdelen/coreutils/tree/ls_printing_performance
testing it again on a different machine (Ubuntu 22.04.5 LTS) now shows me little to no difference (between 0 to 3 %) but varies which version is quicker
on my macOs M1 chip machine the differences vary quite a lot, still up to 20 % but also in both directions.
I interpret this in a way that i cannot use the laptop ever for benchmarking as maybe the system is not "quiet" enough.
But even on the Ubuntu PC the change doesnt seem to be stable.
Ah, that's huge variability indeed. I reran the code on my machine. Also a laptop, but with an Intel chip that has different types of cores though, you need to be careful with thermals (so I wouldn't do that many runs: I'd just leave --min-runs alone and let hyperfine sample over a few seconds as it does by default), and I use taskset -c 0 to force the code to run on a specific type of core.
First thing, you shouldn't call stdout() everywhere, you should use the BufWriter passed as parameter to the existing write! call.
Second thing, even after changing that, I see little difference (1-2% maybe):
taskset -c 0 hyperfine --warmup 3 -L ls target/release/ls,./ls-main "{ls} -lR .git"
Benchmark 1: target/release/ls -lR .git
Time (mean ± σ): 38.3 ms ± 1.1 ms [User: 23.2 ms, System: 14.7 ms]
Range (min … max): 37.8 ms … 47.2 ms 76 runs
Benchmark 2: ./ls-main -lR .git
Time (mean ± σ): 39.1 ms ± 1.9 ms [User: 23.2 ms, System: 15.2 ms]
Range (min … max): 38.4 ms … 55.2 ms 74 runs
Summary
target/release/ls -lR .git ran
1.02 ± 0.06 times faster than ./ls-main -lR .git
I didn't look at samply, so I don't know if you optimized the right calls, missed some critical ones, etc...
I'd like to take this up.
Looked a bit at this. https://share.firefox.dev/431D6Ov
So starting with display_item_long:
- ~We write everything to
output_display, then tooutput. Is that really worth it asoutis a buffered writer anyway?~ Oh I see, we do rely on the size of the vector. - There are a lot of
write!calls in there that can be converted towrite_all - Would be good investigating
pad_leftandpad_right, maybe there's some optimizations there.
I'll do some experiments...
Played a bit with this... Bunch of low(-ish) hanging fruits (not just when printing, but also in terms of avoid computations...). My fixes aren't really clean, I need to learn more about Rust... https://github.com/drinkcat/coreutils/commits/ls-opt/
Getting within 6% of GNU coreutils:
cargo build -r -p uu_ls && taskset -c 0 hyperfine --warmup 3 -L ls ls,target/release/ls,./ls-main "{ls} -lR .git"
Benchmark 1: ls -lR .git
Time (mean ± σ): 9.2 ms ± 0.9 ms [User: 4.2 ms, System: 4.8 ms]
Range (min … max): 8.9 ms … 23.4 ms 281 runs
Benchmark 2: target/release/ls -lR .git
Time (mean ± σ): 9.8 ms ± 0.3 ms [User: 4.5 ms, System: 5.1 ms]
Range (min … max): 9.4 ms … 11.0 ms 260 runs
Benchmark 3: ./ls-main -lR .git
Time (mean ± σ): 12.6 ms ± 1.1 ms [User: 7.3 ms, System: 5.0 ms]
Range (min … max): 12.0 ms … 28.0 ms 213 runs
Summary
ls -lR .git ran
1.06 ± 0.11 times faster than target/release/ls -lR .git
1.36 ± 0.18 times faster than ./ls-main -lR .git