fd
fd copied to clipboard
3x~10x Performance regression between 7.2.0 and >7.3.0 on large folder
Noticed that some fd commends runs much slower (10x slower) when I upgraded my local fd from 6.2.0 to newest 8.3.2, and did a quick version bisect.
Looks like the regression is between 7.2.0 and 7.3.0, and all version I've tested after 7.3.0 (7.4.0, 7.5.0, 8.0.0, 8.1.1, 8.3.2) are all as about the same speed as 7.3.0.
Reproduce script:
set -e
wget -q https://github.com/sharkdp/fd/releases/download/v7.2.0/fd-v7.2.0-x86_64-unknown-linux-musl.tar.gz
tar -xf fd-v7.2.0-x86_64-unknown-linux-musl.tar.gz
wget -q https://github.com/sharkdp/fd/releases/download/v7.3.0/fd-v7.3.0-x86_64-unknown-linux-musl.tar.gz
tar -xf fd-v7.3.0-x86_64-unknown-linux-musl.tar.gz
hyperfine --version
hyperfine \
--warmup 5 \
'./fd-v7.2.0-x86_64-unknown-linux-musl/fd ".*camera_hal.*" ~/chromiumos/src' \
'./fd-v7.3.0-x86_64-unknown-linux-musl/fd ".*camera_hal.*" ~/chromiumos/src'
(I'm using Chrome OS source tree as an example here, but I can reproduce similar regression on other large source tree, for example, linux source tree)
Result:
- On a VPS without SSD, with 24 cores/96 hyperthreads:
hyperfine 1.11.0
Benchmark #1: ./fd-v7.2.0-x86_64-unknown-linux-musl/fd ".*camera_hal.*" ~/chromiumos/src
Time (mean ± σ): 2.468 s ± 0.058 s [User: 90.829 s, System: 115.032 s]
Range (min … max): 2.402 s … 2.555 s 10 runs
Benchmark #2: ./fd-v7.3.0-x86_64-unknown-linux-musl/fd ".*camera_hal.*" ~/chromiumos/src
Time (mean ± σ): 25.529 s ± 0.328 s [User: 222.856 s, System: 1924.844 s]
Range (min … max): 24.980 s … 26.091 s 10 runs
Summary
'./fd-v7.2.0-x86_64-unknown-linux-musl/fd ".*camera_hal.*" ~/chromiumos/src' ran
10.34 ± 0.28 times faster than './fd-v7.3.0-x86_64-unknown-linux-musl/fd ".*camera_hal.*" ~/chromiumos/src'
- On my local laptop with SSD, with 4 cores/8 hyperthreads:
hyperfine 1.13.0
Benchmark 1: ./fd-v7.2.0-x86_64-unknown-linux-musl/fd ".*camera_hal.*" ~/chromiumos/src
Time (mean ± σ): 2.348 s ± 0.101 s [User: 10.347 s, System: 6.298 s]
Range (min … max): 2.237 s … 2.527 s 10 runs
Benchmark 2: ./fd-v7.3.0-x86_64-unknown-linux-musl/fd ".*camera_hal.*" ~/chromiumos/src
Time (mean ± σ): 6.882 s ± 0.090 s [User: 44.010 s, System: 6.813 s]
Range (min … max): 6.783 s … 7.065 s 10 runs
Summary
'./fd-v7.2.0-x86_64-unknown-linux-musl/fd ".*camera_hal.*" ~/chromiumos/src' ran
2.93 ± 0.13 times faster than './fd-v7.3.0-x86_64-unknown-linux-musl/fd ".*camera_hal.*" ~/chromiumos/src'
Also tried adding --color=never
and the result are similar to this, from the changelog the only other suspect is the --exec-batch
command?
Happy to provide additional testing / debug info if needed.
I can reproduce that here, but with -j1
the performance is the same. I think this is https://github.com/sharkdp/fd/issues/710, and the cause is just the musl version being upgraded as a result of Rust being updated. Or maybe this is around when Rust stopped using jemalloc by default.
See also
- https://andygrove.io/2020/05/why-musl-extremely-slow/
- https://github.com/BurntSushi/ripgrep/issues/1268
Tested with the gnu version instead of musl, and verified that this is specific to musl.
- On a VPS without SSD, with 24 cores/96 hyperthreads:
hyperfine 1.11.0
Benchmark #1: ./fd-v7.2.0-x86_64-unknown-linux-gnu/fd ".*camera_hal.*" ~/chromiumos/src
Time (mean ± σ): 2.439 s ± 0.096 s [User: 99.311 s, System: 109.548 s]
Range (min … max): 2.347 s … 2.679 s 10 runs
Benchmark #2: ./fd-v7.3.0-x86_64-unknown-linux-gnu/fd ".*camera_hal.*" ~/chromiumos/src
Time (mean ± σ): 2.947 s ± 0.065 s [User: 138.492 s, System: 49.916 s]
Range (min … max): 2.851 s … 3.046 s 10 runs
Summary
'./fd-v7.2.0-x86_64-unknown-linux-gnu/fd ".*camera_hal.*" ~/chromiumos/src' ran
1.21 ± 0.05 times faster than './fd-v7.3.0-x86_64-unknown-linux-gnu/fd ".*camera_hal.*" ~/chromiumos/src'
There's still a slowdown of ~1.2x, which is probably caused by Rust stopped using jemalloc by default as you said, and jemalloc being faster in this use case than glibc malloc?
I think this is covered by #710 anyway, so feel free to close this as duplicate.
Thank you for reporting this anyway!
See also: https://dev.to/sharkdp/an-unexpected-performance-regression-11ai
Back then, the performance regression was between 7.0 and 7.1, so that doesn't quite fit with your results. You can easily check if a particular fd
executable uses jemalloc by doing something like
strings <fd-executable> | grep jemalloc
Did a quick grep from binaries downloaded from https://github.com/sharkdp/fd/releases:
Using jemalloc:
- fd-v7.2.0-x86_64-unknown-linux-musl
- fd-v7.2.0-x86_64-unknown-linux-gnu
- fd-v7.4.0-x86_64-unknown-linux-gnu
- fd-v8.3.2-x86_64-unknown-linux-gnu
Not using jemalloc:
- fd-v7.3.0-x86_64-unknown-linux-musl
- fd-v7.4.0-x86_64-unknown-linux-musl
- fd-v8.0.0-x86_64-unknown-linux-musl
- fd-v8.2.1-x86_64-unknown-linux-musl
- fd-v8.3.2-x86_64-unknown-linux-musl
- fd-v7.3.0-x86_64-unknown-linux-gnu
Looks like the patch to use jemalloc in 7.4.0 is not applied to musl build (which is also stated in the 7.4.0 release notes).
Also tried building musl + jemalloc on the master branch (c577b0838b2e), with cross build --target=x86_64-unknown-linux-musl
(https://github.com/gnzlbg/jemallocator/issues/124#issuecomment-486561511), and the performance is much better than the non-jemalloc version:
Benchmark #1: ~/temp/fd-musl-no-jemalloc ".*camera_hal.*" ~/chromiumos/src
Time (mean ± σ): 18.901 s ± 0.281 s [User: 166.882 s, System: 1532.500 s]
Range (min … max): 18.467 s … 19.252 s 10 runs
Benchmark #2: ~/temp/fd-musl-jemalloc ".*camera_hal.*" ~/chromiumos/src
Time (mean ± σ): 4.614 s ± 0.570 s [User: 26.295 s, System: 361.069 s]
Range (min … max): 3.435 s … 5.445 s 10 runs
Summary
'~/temp/fd-musl-jemalloc ".*camera_hal.*" ~/chromiumos/src' ran
4.10 ± 0.51 times faster than '~/temp/fd-musl-no-jemalloc ".*camera_hal.*" ~/chromiumos/src'
So it might be worthwhile to enable jemalloc for musl build too. (From a quick glance at the github action the musl version is already building with cross
, so there shouldn't be any build issue)
It's still slower than 7.2.0 but that's likely #599.