coreutils `ls`: Investigate printing performance

In seq, we found that directly calling stdout.write_all(str.as_bytes())? is quite a bit faster than using format to do the same operation: write!(stdout, "{str}")?. https://github.com/uutils/coreutils/pull/7562

ls uses a lot of write!(..., "{}", ..)?; patterns. Would be nice to know if switching to write_all would improve performance.

Had a very quick look at samply output, and printing dominate more of the runtime in long format outputs (e.g. ls -l), so maybe it'd be good to start investigating that use case.

For reference, we're doing a bit worse than coreutils:

$ cargo build -r -p uu_ls && taskset -c 0 hyperfine --warmup 3 -L ls target/release/ls,ls "{ls} -lR .git"
    Finished `release` profile [optimized] target(s) in 0.12s
Benchmark 1: target/release/ls -lR .git
  Time (mean ± σ):      32.5 ms ±   0.8 ms    [User: 18.7 ms, System: 13.5 ms]
  Range (min … max):    31.2 ms …  34.4 ms    84 runs
 
Benchmark 2: ls -lR .git
  Time (mean ± σ):      23.7 ms ±   1.5 ms    [User: 11.3 ms, System: 12.0 ms]
  Range (min … max):    23.0 ms …  38.6 ms    114 runs

Summary
  ls -lR .git ran
    1.37 ± 0.10 times faster than target/release/ls -lR .git

Mar 24 '25 19:03 drinkcat

are you going to work on it ? :)

Mar 24 '25 20:03 sylvestre

I think I'm still deep into printf/seq issues for a while ,-) Happy if somebody else gets to it.

Mar 24 '25 20:03 drinkcat

i tried to benchmark it and running hyperfine --warmup 3 --min-runs 1000 --max-runs 10000 -L ls target/release/ls,../clean_coreutils/target/release/ls "{ls} -lR .git" gave me very inconclusive results

inconsistenly the current version was up to 8-16% quicker than a version in which i changed all write!(out, ...) calls to stdout().write_all(...) calls

i havent tried with write!() calls which are not to the stdout since i first wanted to just check the changes for this only but maybe you could shed some light if i did something wrong or maybe it just is a dead end?

Mar 26 '25 13:03 cerdelen

Ha, that's a bit surprising! Do you have your branch pushed somewhere for us to have a look? (no need to make a PR)

Mar 26 '25 17:03 drinkcat

i had apparently deleted or stashed the changes somewhere but redid it quickly

https://github.com/cerdelen/coreutils/tree/ls_printing_performance

testing it again on a different machine (Ubuntu 22.04.5 LTS) now shows me little to no difference (between 0 to 3 %) but varies which version is quicker

on my macOs M1 chip machine the differences vary quite a lot, still up to 20 % but also in both directions.

I interpret this in a way that i cannot use the laptop ever for benchmarking as maybe the system is not "quiet" enough.

But even on the Ubuntu PC the change doesnt seem to be stable.

Mar 29 '25 13:03 cerdelen

Ah, that's huge variability indeed. I reran the code on my machine. Also a laptop, but with an Intel chip that has different types of cores though, you need to be careful with thermals (so I wouldn't do that many runs: I'd just leave --min-runs alone and let hyperfine sample over a few seconds as it does by default), and I use taskset -c 0 to force the code to run on a specific type of core.

First thing, you shouldn't call stdout() everywhere, you should use the BufWriter passed as parameter to the existing write! call.

Second thing, even after changing that, I see little difference (1-2% maybe):

taskset -c 0 hyperfine --warmup 3 -L ls target/release/ls,./ls-main "{ls} -lR .git"
Benchmark 1: target/release/ls -lR .git
  Time (mean ± σ):      38.3 ms ±   1.1 ms    [User: 23.2 ms, System: 14.7 ms]
  Range (min … max):    37.8 ms …  47.2 ms    76 runs
  
Benchmark 2: ./ls-main -lR .git
  Time (mean ± σ):      39.1 ms ±   1.9 ms    [User: 23.2 ms, System: 15.2 ms]
  Range (min … max):    38.4 ms …  55.2 ms    74 runs
  
Summary
  target/release/ls -lR .git ran
    1.02 ± 0.06 times faster than ./ls-main -lR .git

I didn't look at samply, so I don't know if you optimized the right calls, missed some critical ones, etc...

Mar 29 '25 14:03 drinkcat

I'd like to take this up.

Mar 31 '25 19:03 kiran-4444

Looked a bit at this. https://share.firefox.dev/431D6Ov

So starting with display_item_long:

~We write everything to output_display, then to output. Is that really worth it as out is a buffered writer anyway?~ Oh I see, we do rely on the size of the vector.
There are a lot of write! calls in there that can be converted to write_all
Would be good investigating pad_left and pad_right, maybe there's some optimizations there.

I'll do some experiments...

Apr 19 '25 16:04 drinkcat

Played a bit with this... Bunch of low(-ish) hanging fruits (not just when printing, but also in terms of avoid computations...). My fixes aren't really clean, I need to learn more about Rust... https://github.com/drinkcat/coreutils/commits/ls-opt/

Getting within 6% of GNU coreutils:

cargo build -r -p uu_ls && taskset -c 0 hyperfine --warmup 3 -L ls ls,target/release/ls,./ls-main "{ls} -lR .git"
Benchmark 1: ls -lR .git
  Time (mean ± σ):       9.2 ms ±   0.9 ms    [User: 4.2 ms, System: 4.8 ms]
  Range (min … max):     8.9 ms …  23.4 ms    281 runs
  
Benchmark 2: target/release/ls -lR .git
  Time (mean ± σ):       9.8 ms ±   0.3 ms    [User: 4.5 ms, System: 5.1 ms]
  Range (min … max):     9.4 ms …  11.0 ms    260 runs
 
Benchmark 3: ./ls-main -lR .git
  Time (mean ± σ):      12.6 ms ±   1.1 ms    [User: 7.3 ms, System: 5.0 ms]
  Range (min … max):    12.0 ms …  28.0 ms    213 runs
  
Summary
  ls -lR .git ran
    1.06 ± 0.11 times faster than target/release/ls -lR .git
    1.36 ± 0.18 times faster than ./ls-main -lR .git

Apr 19 '25 21:04 drinkcat