fd icon indicating copy to clipboard operation
fd copied to clipboard

3x~10x Performance regression between 7.2.0 and >7.3.0 on large folder

Open peter50216 opened this issue 2 years ago • 5 comments

Noticed that some fd commends runs much slower (10x slower) when I upgraded my local fd from 6.2.0 to newest 8.3.2, and did a quick version bisect.

Looks like the regression is between 7.2.0 and 7.3.0, and all version I've tested after 7.3.0 (7.4.0, 7.5.0, 8.0.0, 8.1.1, 8.3.2) are all as about the same speed as 7.3.0.

Reproduce script:

set -e

wget -q https://github.com/sharkdp/fd/releases/download/v7.2.0/fd-v7.2.0-x86_64-unknown-linux-musl.tar.gz
tar -xf fd-v7.2.0-x86_64-unknown-linux-musl.tar.gz

wget -q https://github.com/sharkdp/fd/releases/download/v7.3.0/fd-v7.3.0-x86_64-unknown-linux-musl.tar.gz
tar -xf fd-v7.3.0-x86_64-unknown-linux-musl.tar.gz

hyperfine --version
hyperfine \
  --warmup 5 \
  './fd-v7.2.0-x86_64-unknown-linux-musl/fd ".*camera_hal.*" ~/chromiumos/src' \
  './fd-v7.3.0-x86_64-unknown-linux-musl/fd ".*camera_hal.*" ~/chromiumos/src'

(I'm using Chrome OS source tree as an example here, but I can reproduce similar regression on other large source tree, for example, linux source tree)

Result:

  • On a VPS without SSD, with 24 cores/96 hyperthreads:
hyperfine 1.11.0
Benchmark #1: ./fd-v7.2.0-x86_64-unknown-linux-musl/fd ".*camera_hal.*" ~/chromiumos/src
  Time (mean ± σ):      2.468 s ±  0.058 s    [User: 90.829 s, System: 115.032 s]
  Range (min … max):    2.402 s …  2.555 s    10 runs
 
Benchmark #2: ./fd-v7.3.0-x86_64-unknown-linux-musl/fd ".*camera_hal.*" ~/chromiumos/src
  Time (mean ± σ):     25.529 s ±  0.328 s    [User: 222.856 s, System: 1924.844 s]
  Range (min … max):   24.980 s … 26.091 s    10 runs
 
Summary
  './fd-v7.2.0-x86_64-unknown-linux-musl/fd ".*camera_hal.*" ~/chromiumos/src' ran
   10.34 ± 0.28 times faster than './fd-v7.3.0-x86_64-unknown-linux-musl/fd ".*camera_hal.*" ~/chromiumos/src'
  • On my local laptop with SSD, with 4 cores/8 hyperthreads:
hyperfine 1.13.0
Benchmark 1: ./fd-v7.2.0-x86_64-unknown-linux-musl/fd ".*camera_hal.*" ~/chromiumos/src
  Time (mean ± σ):      2.348 s ±  0.101 s    [User: 10.347 s, System: 6.298 s]
  Range (min … max):    2.237 s …  2.527 s    10 runs
 
Benchmark 2: ./fd-v7.3.0-x86_64-unknown-linux-musl/fd ".*camera_hal.*" ~/chromiumos/src
  Time (mean ± σ):      6.882 s ±  0.090 s    [User: 44.010 s, System: 6.813 s]
  Range (min … max):    6.783 s …  7.065 s    10 runs

Summary
  './fd-v7.2.0-x86_64-unknown-linux-musl/fd ".*camera_hal.*" ~/chromiumos/src' ran
    2.93 ± 0.13 times faster than './fd-v7.3.0-x86_64-unknown-linux-musl/fd ".*camera_hal.*" ~/chromiumos/src'

Also tried adding --color=never and the result are similar to this, from the changelog the only other suspect is the --exec-batch command?

Happy to provide additional testing / debug info if needed.

peter50216 avatar Mar 08 '22 04:03 peter50216

I can reproduce that here, but with -j1 the performance is the same. I think this is https://github.com/sharkdp/fd/issues/710, and the cause is just the musl version being upgraded as a result of Rust being updated. Or maybe this is around when Rust stopped using jemalloc by default.

See also

  • https://andygrove.io/2020/05/why-musl-extremely-slow/
  • https://github.com/BurntSushi/ripgrep/issues/1268

tavianator avatar Mar 08 '22 15:03 tavianator

Tested with the gnu version instead of musl, and verified that this is specific to musl.

  • On a VPS without SSD, with 24 cores/96 hyperthreads:
hyperfine 1.11.0
Benchmark #1: ./fd-v7.2.0-x86_64-unknown-linux-gnu/fd ".*camera_hal.*" ~/chromiumos/src
  Time (mean ± σ):      2.439 s ±  0.096 s    [User: 99.311 s, System: 109.548 s]
  Range (min … max):    2.347 s …  2.679 s    10 runs
 
Benchmark #2: ./fd-v7.3.0-x86_64-unknown-linux-gnu/fd ".*camera_hal.*" ~/chromiumos/src
  Time (mean ± σ):      2.947 s ±  0.065 s    [User: 138.492 s, System: 49.916 s]
  Range (min … max):    2.851 s …  3.046 s    10 runs
 
Summary
  './fd-v7.2.0-x86_64-unknown-linux-gnu/fd ".*camera_hal.*" ~/chromiumos/src' ran
    1.21 ± 0.05 times faster than './fd-v7.3.0-x86_64-unknown-linux-gnu/fd ".*camera_hal.*" ~/chromiumos/src'

There's still a slowdown of ~1.2x, which is probably caused by Rust stopped using jemalloc by default as you said, and jemalloc being faster in this use case than glibc malloc?

I think this is covered by #710 anyway, so feel free to close this as duplicate.

peter50216 avatar Mar 09 '22 03:03 peter50216

Thank you for reporting this anyway!

See also: https://dev.to/sharkdp/an-unexpected-performance-regression-11ai

Back then, the performance regression was between 7.0 and 7.1, so that doesn't quite fit with your results. You can easily check if a particular fd executable uses jemalloc by doing something like

strings <fd-executable> | grep jemalloc

sharkdp avatar Mar 09 '22 07:03 sharkdp

Did a quick grep from binaries downloaded from https://github.com/sharkdp/fd/releases:

Using jemalloc:

  • fd-v7.2.0-x86_64-unknown-linux-musl
  • fd-v7.2.0-x86_64-unknown-linux-gnu
  • fd-v7.4.0-x86_64-unknown-linux-gnu
  • fd-v8.3.2-x86_64-unknown-linux-gnu

Not using jemalloc:

  • fd-v7.3.0-x86_64-unknown-linux-musl
  • fd-v7.4.0-x86_64-unknown-linux-musl
  • fd-v8.0.0-x86_64-unknown-linux-musl
  • fd-v8.2.1-x86_64-unknown-linux-musl
  • fd-v8.3.2-x86_64-unknown-linux-musl
  • fd-v7.3.0-x86_64-unknown-linux-gnu

Looks like the patch to use jemalloc in 7.4.0 is not applied to musl build (which is also stated in the 7.4.0 release notes).

peter50216 avatar Mar 09 '22 08:03 peter50216

Also tried building musl + jemalloc on the master branch (c577b0838b2e), with cross build --target=x86_64-unknown-linux-musl (https://github.com/gnzlbg/jemallocator/issues/124#issuecomment-486561511), and the performance is much better than the non-jemalloc version:

Benchmark #1: ~/temp/fd-musl-no-jemalloc ".*camera_hal.*" ~/chromiumos/src
  Time (mean ± σ):     18.901 s ±  0.281 s    [User: 166.882 s, System: 1532.500 s]
  Range (min … max):   18.467 s … 19.252 s    10 runs
 
Benchmark #2: ~/temp/fd-musl-jemalloc ".*camera_hal.*" ~/chromiumos/src
  Time (mean ± σ):      4.614 s ±  0.570 s    [User: 26.295 s, System: 361.069 s]
  Range (min … max):    3.435 s …  5.445 s    10 runs
 
Summary
  '~/temp/fd-musl-jemalloc ".*camera_hal.*" ~/chromiumos/src' ran
    4.10 ± 0.51 times faster than '~/temp/fd-musl-no-jemalloc ".*camera_hal.*" ~/chromiumos/src'

So it might be worthwhile to enable jemalloc for musl build too. (From a quick glance at the github action the musl version is already building with cross, so there shouldn't be any build issue)

It's still slower than 7.2.0 but that's likely #599.

peter50216 avatar Mar 09 '22 08:03 peter50216