fd fd is much slower when run with multiple threads

fd is much slower when run with multiple threads

Open aDotInTheVoid opened this issue 1 year ago • 26 comments

When not using -j1, fd takes thousands of times longer.

$ git clone https://github.com/sharkdp/fd.git
$ cd fd/
$ hyperfine -w 1 "fd" "fd -j1" -N
Benchmark 1: fd
  Time (mean ± σ):      3.601 s ±  1.014 s    [User: 0.008 s, System: 0.001 s]
  Range (min … max):    3.280 s …  6.487 s    10 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: fd -j1
  Time (mean ± σ):       3.0 ms ±   0.4 ms    [User: 2.3 ms, System: 0.0 ms]
  Range (min … max):     2.4 ms …   8.3 ms    792 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  'fd -j1' ran
 1212.48 ± 372.27 times faster than 'fd'

$ uname -a
Linux Ashtabula 5.10.102.1-microsoft-standard-WSL2 #1 SMP Wed Mar 2 00:30:59 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Oct 08 '22 23:10 aDotInTheVoid

Did you build fd yourself or install a pre-built version?

Oct 10 '22 15:10 tavianator

Installed from source with cargo

$ which fd
/home/nixon/.cargo/bin/fd
$ fd --version
fd 8.4.0

Oct 10 '22 23:10 aDotInTheVoid

Did you build it with the release profile?

Oct 11 '22 04:10 tmccombs

I got similar results, I've downloaded fd from void repositories which I believe are distributed under release profile.

  fd >>> hyperfine -w 5 --prepare 'echo 3 | sudo tee /proc/sys/vm/drop_caches' "fd" "fd -j1" -N
Benchmark 1: fd
  Time (mean ± σ):       4.9 ms ±   0.7 ms    [User: 2.0 ms, System: 3.7 ms]
  Range (min … max):     4.1 ms …   7.0 ms    416 runs
 
Benchmark 2: fd -j1
  Time (mean ± σ):       3.7 ms ±   0.2 ms    [User: 1.7 ms, System: 2.2 ms]
  Range (min … max):     3.3 ms …   5.0 ms    629 runs
 
Summary
  'fd -j1' ran
    1.33 ± 0.20 times faster than 'fd'

Oct 13 '22 10:10 Animeshz

Did you build it with the release profile?

Yes, cargo-install runs in release mode

Full Log

$ cargo install -f fd-find
    Updating crates.io index
  Installing fd-find v8.4.0
   Compiling libc v0.2.135
   Compiling autocfg v1.1.0
   Compiling cfg-if v1.0.0
   Compiling bitflags v1.3.2
   Compiling memchr v2.5.0
   Compiling io-lifetimes v0.7.3
   Compiling log v0.4.17
   Compiling os_str_bytes v6.3.0
   Compiling rustix v0.35.11
   Compiling hashbrown v0.12.3
   Compiling once_cell v1.15.0
   Compiling linux-raw-sys v0.0.46
   Compiling cc v1.0.73
   Compiling termcolor v1.1.3
   Compiling textwrap v0.15.1
   Compiling strsim v0.10.0
   Compiling fs_extra v1.2.0
   Compiling regex-syntax v0.6.27
   Compiling crossbeam-utils v0.8.12
   Compiling version_check v0.9.4
   Compiling lazy_static v1.4.0
   Compiling fnv v1.0.7
   Compiling anyhow v1.0.65
   Compiling same-file v1.0.6
   Compiling ansi_term v0.12.1
   Compiling iana-time-zone v0.1.51
   Compiling humantime v2.1.0
   Compiling normpath v0.3.2
   Compiling clap_lex v0.2.4
   Compiling indexmap v1.9.1
   Compiling num-traits v0.2.15
   Compiling num-integer v0.1.45
   Compiling thread_local v1.1.4
   Compiling walkdir v2.3.2
   Compiling lscolors v0.10.0
   Compiling jemalloc-sys v0.5.2+5.3.0-patched
   Compiling aho-corasick v0.7.19
   Compiling bstr v0.2.17
   Compiling atty v0.2.14
   Compiling clap v3.2.22
   Compiling regex v1.6.0
   Compiling nix v0.25.0
   Compiling nix v0.24.2
   Compiling time v0.1.44
   Compiling dirs-sys-next v0.1.2
   Compiling users v0.11.0
   Compiling num_cpus v1.13.1
   Compiling dirs-next v2.0.0
   Compiling argmax v0.3.1
   Compiling globset v0.4.9
   Compiling chrono v0.4.22
   Compiling ctrlc v3.2.3
   Compiling ignore v0.4.18
   Compiling terminal_size v0.2.1
   Compiling clap_complete v3.2.5
   Compiling fd-find v8.4.0
   Compiling jemallocator v0.5.0
    Finished release [optimized] target(s) in 5m 46s
   Replacing /home/nixon/.cargo/bin/fd
    Replaced package `fd-find v8.4.0` with `fd-find v8.4.0` (executable `fd`)
$ git clone https://github.com/sharkdp/fd.git
Cloning into 'fd'...
remote: Enumerating objects: 5159, done.
remote: Counting objects: 100% (20/20), done.
remote: Compressing objects: 100% (18/18), done.
remote: Total 5159 (delta 6), reused 12 (delta 2), pack-reused 5139
Receiving objects: 100% (5159/5159), 1.43 MiB | 2.26 MiB/s, done.
Resolving deltas: 100% (3426/3426), done.
$ cd fd
$ hyperfine -w 1 "fd" "fd -j1" -N
Benchmark 1: fd
  Time (mean ± σ):      2.057 s ±  0.956 s    [User: 0.014 s, System: 0.000 s]
  Range (min … max):    0.741 s …  3.739 s    10 runs

Benchmark 2: fd -j1
  Time (mean ± σ):       6.8 ms ±   1.1 ms    [User: 5.5 ms, System: 0.0 ms]
  Range (min … max):     4.4 ms …  16.9 ms    409 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  'fd -j1' ran
  303.82 ± 149.51 times faster than 'fd'

I got similar results,

No, you haven't. Yours is only 1.3x slower, whereas for some reason for me it's over 100x slower.

Oct 13 '22 16:10 aDotInTheVoid

I'm guessing this has something to do with WSL. Maybe futex() triggers a hypercall or something? Can you paste the output of

$ strace -cf fd >/dev/null
$ strace -cf fd -j1 >/dev/null

Oct 25 '22 23:10 tavianator

Also what filesystem is this running in?

Oct 25 '22 23:10 tavianator

$ strace -cf fd -j1 >/dev/null
strace: Process 23061 attached
strace: Process 23062 attached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ------------------
  0.00    0.000000           0        15           read
  0.00    0.000000           0         1           write
  0.00    0.000000           0         2           open
  0.00    0.000000           0        19           close
  0.00    0.000000           0         1           poll
  0.00    0.000000           0         1           lseek
  0.00    0.000000           0        35           mmap
  0.00    0.000000           0        12           mprotect
  0.00    0.000000           0        15           munmap
  0.00    0.000000           0         3           brk
  0.00    0.000000           0         6           rt_sigaction
  0.00    0.000000           0         9           rt_sigprocmask
  0.00    0.000000           0         1         1 ioctl
  0.00    0.000000           0         4           pread64
  0.00    0.000000           0         1         1 access
  0.00    0.000000           0         9           madvise
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         2           getcwd
  0.00    0.000000           0         1         1 readlink
  0.00    0.000000           0         9           sigaltstack
  0.00    0.000000           0         2         1 arch_prctl
  0.00    0.000000           0        40           futex
  0.00    0.000000           0         4           sched_getaffinity
  0.00    0.000000           0        18           getdents64
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0        18         1 openat
  0.00    0.000000           0        14           newfstatat
  0.00    0.000000           0         3           set_robust_list
  0.00    0.000000           0         2           prlimit64
  0.00    0.000000           0         3           getrandom
  0.00    0.000000           0        75        69 statx
  0.00    0.000000           0         3           rseq
  0.00    0.000000           0         2           clone3
------ ----------- ----------- --------- --------- ------------------
100.00    0.000000           0       332        74 total

$ strace -cf fd  >/dev/null
strace: Process 24756 attached
strace: Process 24757 attached
strace: Process 24758 attached
strace: Process 24759 attached
strace: Process 24760 attached
strace: Process 24761 attached
strace: Process 24762 attached
strace: Process 24763 attached
strace: Process 24764 attached
strace: Process 24765 attached
strace: Process 24766 attached
strace: Process 24767 attached
strace: Process 24768 attached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ------------------
  0.00    0.000000           0        19           read
  0.00    0.000000           0         1           write
  0.00    0.000000           0         2           open
  0.00    0.000000           0        22           close
  0.00    0.000000           0         1           poll
  0.00    0.000000           0         2           lseek
  0.00    0.000000           0        69           mmap
  0.00    0.000000           0        45           mprotect
  0.00    0.000000           0        36           munmap
  0.00    0.000000           0         3           brk
  0.00    0.000000           0         6           rt_sigaction
  0.00    0.000000           0        53           rt_sigprocmask
  0.00    0.000000           0         1         1 ioctl
  0.00    0.000000           0         4           pread64
  0.00    0.000000           0         1         1 access
  0.00    0.000000           0        20           madvise
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         2           getcwd
  0.00    0.000000           0         1         1 readlink
  0.00    0.000000           0        42           sigaltstack
  0.00    0.000000           0         2         1 arch_prctl
  0.00    0.000000           0        40         1 futex
  0.00    0.000000           0        17           sched_getaffinity
  0.00    0.000000           0        18           getdents64
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0        12           clock_nanosleep
  0.00    0.000000           0        21         1 openat
  0.00    0.000000           0        14           newfstatat
  0.00    0.000000           0        14           set_robust_list
  0.00    0.000000           0         2           prlimit64
  0.00    0.000000           0         3           getrandom
  0.00    0.000000           0        76        69 statx
  0.00    0.000000           0        14           rseq
  0.00    0.000000           0        13           clone3
------ ----------- ----------- --------- --------- ------------------
100.00    0.000000           0       578        75 total

Filesystem is ext4

nixon@Ashtabula:~/tmp/fd$ pwd
/home/nixon/tmp/fd
nixon@Ashtabula:~/tmp/fd$ df -T
Filesystem     Type    1K-blocks      Used Available Use% Mounted on
/dev/sdc       ext4    263174212  66091960 183644096  27% /
none           tmpfs     3890260       128   3890132   1% /mnt/wslg
none           tmpfs     3890260         4   3890256   1% /mnt/wsl
tools          9p      499221500 477214304  22007196  96% /init
none           tmpfs     3890260         4   3890256   1% /run
none           tmpfs     3890260         0   3890260   0% /run/lock
none           tmpfs     3890260         0   3890260   0% /run/shm
none           tmpfs     3890260         0   3890260   0% /run/user
tmpfs          tmpfs     3890260         0   3890260   0% /sys/fs/cgroup
drivers        9p      499221500 477214304  22007196  96% /usr/lib/wsl/drivers
lib            9p      499221500 477214304  22007196  96% /usr/lib/wsl/lib
none           overlay   3890260      2168   3888092   1% /mnt/wslg/versions.txt
none           overlay   3890260      2168   3888092   1% /mnt/wslg/doc
drvfs          9p      499221500 477214304  22007196  96% /mnt/c

Oct 26 '22 08:10 aDotInTheVoid

@aDotInTheVoid Please use perf for performance recordings as strace has significant overhead.

WSL2 has severe and known performance issues, for example this one is specific for the filesystem: https://github.com/microsoft/WSL/issues/4197

Only threading alone has up to 5x performance penalties https://github.com/dotnet/runtime/issues/42994

More over, WSL2 is a full VM, so you will never get performance close to a native Linux Kernel https://learn.microsoft.com/en-us/windows/wsl/compare-versions. For that, something like kvm would be needed, but those things are only available as proprietary products on Windows.

Nov 11 '22 15:11 matu3ba

WSL2 has severe and known performance issues, for example this one is specific for the filesystem: microsoft/WSL#4197

That's why I asked what filesystem was being used. Accessing Windows files over 9p is slow in WSL2, but the OP is accessing Linux files in an ext4 filesystem. Since this is just the regular Linux ext4 implementation, it should be just about as fast as native Linux (except for the actual I/O).

Only threading alone has up to 5x performance penalties dotnet/runtime#42994

That looks potentially relevant. Thread-local storage performs poorly on WSL2 for some reason.

More over, WSL2 is a full VM, so you will never get performance close to a native Linux Kernel https://learn.microsoft.com/en-us/windows/wsl/compare-versions. For that, something like kvm would be needed, but those things are only available as proprietary products on Windows.

WSL2 uses the "Virtual Machine Platform", a subset of Hyper-V which is "something like KVM". It should be close to native performance for things that don't need to cross the hypervisor boundary often.

Nov 11 '22 15:11 tavianator

@aDotInTheVoid Please use perf for performance recordings as strace has significant overhead.

Indeed. I asked for strace results just to get a quick idea if anything stood out. What stands out is that strace seems to be completely useless on WSL2, as all times are 0. Perhaps there is not a high-enough resolution clock? (See https://github.com/microsoft/WSL/issues/77, https://github.com/microsoft/WSL/issues/6029.)

If you want to try perf, start with

$ perf trace record fd >/dev/null
$ perf trace -i perf.data -s

(and same with -j1). But since the execution time is charged to neither the user nor the kernel (1.014 s [User: 0.008 s, System: 0.001 s]), I'm not sure we'll see anything.

Actually now that I think about it, timer resolution might be the issue. Perhaps short sleeps are becoming very long due to imprecision.

Nov 11 '22 16:11 tavianator

I tried to find a simple way to profile sleep times. One way is offcputime from bcc-tools:

# /usr/share/bcc/tools/offcputime -f | grep '^fd' >fd.log &
# fd >/dev/null
# pkill -INT offcputime

(and repeat for -j1).

Nov 11 '22 19:11 tavianator

I'm hitting this issue as well on FreeBSD 13.1:

hyperfine -w 1 "fd" "fd -j1" -N
Benchmark 1: fd
  Time (mean ± σ):     115.7 ms ±  10.4 ms    [User: 38.6 ms, System: 127.0 ms]
  Range (min … max):    99.5 ms … 131.4 ms    28 runs
 
Benchmark 2: fd -j1
  Time (mean ± σ):      31.1 ms ±   3.8 ms    [User: 15.1 ms, System: 20.9 ms]
  Range (min … max):    24.9 ms …  41.9 ms    103 runs
 
Summary
  'fd -j1' ran
    3.72 ± 0.56 times faster than 'fd'

Nov 26 '22 17:11 yonas

I believe the bug I was about to report is actually this bug.

I use fd in some test suites and I noticed a slowdown. I investigated and I found that simply running fd with no arguments in an empty directory on /tmp (which is in-memory!) was taking 0.2 seconds. This is ridiculously slow on my machine (Ryzen Threadripper 3990x with 64 cores/128 threads running on NVMe). Multiplied by the number of times I was calling fd in my suite, this immediately explained why multiple seconds had been added at some point.

When I add the -j1 argument, as someone else indicated, the boot runtime of running fd alone drops to 0.01 seconds, which is 20x faster and far more reasonable an expectation.

I'm on NixOS unstable running on ZFS root.

@matu3ba & @tavianator , this is NOT a filesystem problem (IMHO). This looks like a heuristics bug regarding non-specification of the number of concurrent jobs to run (per the mandocs, not specifying it "uses heuristics")

At least I now know a workaround that works!

Dec 03 '22 16:12 pmarreck

@pmarreck I'm pretty sure that's not the same bug. I believe this one is specific to WSL. However, I do think limiting the number of threads by default makes sense. I have the 24 core/48 thread Threadripper and fd in /tmp/empty takes 0.045s, but fd -j128 takes 0.104s. There's likely not much benefit past something like 8 threads anyway.

Dec 03 '22 17:12 tavianator

right, but isn't there something wrong with the "heuristics" mentioned in the manpages for not specifying the number of jobs, if simply not specifying how many threads you want it to use results in worse or even worst-case performance on a given CPU architecture? I don't even know if it's a configuration option, because I'd hate to have to specify it for literally every search request. (In my case it's not terribly inconvenient since I already use a wrapper function around it, but still.) Whatever "heuristics" are used should be torn out and just default to something like "half the number of cores with a max of 6/8" because after that you're probably bottlenecked on either disk I/O or the setup/teardown of the threads, anyway...

The reason why I think it's related is simply because the given solution (specifying -j1) also solves my problem

Dec 03 '22 21:12 pmarreck

Whatever "heuristics" are used should be torn out and just default to something like "half the number of cores with a max of 6/8" because after that you're probably bottlenecked on either disk I/O or the setup/teardown of the threads, anyway...

Please make a proper issue with accurate data, as "just default to something" and "you're probably" are no good start for a metric.
Please provide a script people can run on their pc to get data comparison.

bottlenecked on either disk I/O or the setup/teardown of the threads

A flamegraph would be the perfect tool to analyze this: https://www.brendangregg.com/offcpuanalysis.html

For the meantime: Since fd provides no shell completions anyway: Creating an alias should work around the problem for the time being.

Dec 03 '22 21:12 matu3ba

Whatever "heuristics" are used should be torn out and just default to something like "half the number of cores with a max of 6/8"

The current "heuristics" is just to use the number of CPU cores, as returned by num_cpus::get. Maybe it would make more sense to use get_physical? (which would return half the number of logical cores for your threadripper with hyper-threading).

I think it would make sense to have a maximum on that for the default number of threads, although I'm not sure what the best value of that would be.

Dec 04 '22 05:12 tmccombs

I think it would make sense to have a maximum on that for the default number of threads, although I'm not sure what the best value of that would be.

Yes. This article is relevant for that: https://www.codeguru.com/cplusplus/why-too-many-threads-hurts-performance-and-what-to-do-about-it/. For example mold uses statically linked intel tbb for that and here is a overall functionality overview.

Does anything like this exist for Rust? A: No, see also this reddit thread https://www.reddit.com/r/rust/comments/p0a3mf/async_scheduler_optimised_for_highcompute/. I doubt that the complexity of async is worth it.

Dec 04 '22 11:12 matu3ba

@matu3ba This script works over here:

#!/usr/bin/env bash

_fdtest() {
  local - # scant bash docs on this but this apparently automatically resets shellopts when the function exits
  set -o errexit
  local _testlocname=$(echo $RANDOM | md5sum | cut -c1-8)
  local _testloc="/tmp/$_testlocname"
  local cpu_count=$(awk '/^processor/{n+=1}END{print n}' /proc/cpuinfo)
  echo "Testing $_testloc with $cpu_count CPUs"
  # the point of [ 1 == 0 ] below is to fail the line and trigger errexit IF errexit is set
  mkdir -p $_testloc >/dev/null 2>&1 || ( echo "Cannot create test directory '$_testloc' in _fdtest: ${BASH_SOURCE[0]}:${BASH_LINENO[0]}"; [ 1 == 0 ] )
  touch $_testloc/$_testlocname
  pushd $_testloc >/dev/null
  echo
  echo -n "Without -j1 argument:"
  time for ((n=0;n<10;n++)); do fd $_testlocname >/dev/null; done
  echo
  echo -n "With -j1 argument:"
  time for ((n=0;n<10;n++)); do fd -j1 $_testlocname >/dev/null; done
  popd >/dev/null
  rm $_testloc/$_testlocname
  rm -d $_testloc
}

_fdtest

Output for me (after chmod +x ~/Documents/fdtest.sh):

❯ ~/Documents/fdtest.sh
Testing /tmp/5ee7987d with 128 CPUs

Without -j1 argument:
real    0m1.665s
user    0m0.164s
sys     0m1.560s

With -j1 argument:
real    0m0.038s
user    0m0.007s
sys     0m0.033s

It's about a 43x slowdown. At least with this number of detected CPU's. (I believe it's actually, technically, 64 CPUs and 128 threads, but anyway.)

Dec 05 '22 18:12 pmarreck

right, but isn't there something wrong with the "heuristics" mentioned in the manpages for not specifying the number of jobs, if simply not specifying how many threads you want it to use results in worse or even worst-case performance on a given CPU architecture? I don't even know if it's a configuration option, because I'd hate to have to specify it for literally every search request. (In my case it's not terribly inconvenient since I already use a wrapper function around it, but still.) Whatever "heuristics" are used should be torn out and just default to something like "half the number of cores with a max of 6/8" because after that you're probably bottlenecked on either disk I/O or the setup/teardown of the threads, anyway...

I would appreciate if we could calm down a bit :smile:. The current default was not chosen without reason. It's based on benchmarks on my machine (8 core, see disclaimer concerning benchmarks in the README: one particular benchmark on one particular machine). You can see some past benchmark results here or here. Or I can run one right now, on a different machine (12-core):

hyperfine \
    --parameter-scan threads 1 16 \
    --warmup 3 \
    --export-json results.json \
    "fd -j {threads}"

As you can tell, using N_threads=N_cores=12 is not a very bad heuristic in this case. ~~I think we even used N_threads = 3 × N_cores in the past, because that resulted in even better performance for either warm-cache or cold-cache searches (I don't remember). But then we settled on the current strategy as a good tradeoff between the two scenarios.~~ (no, that was in a different - but similar - project: https://github.com/sharkdp/diskus/issues/38#issuecomment-612772867)

But I admit: startup time is a different story. In an empty directory, it looks like this:

But if I have to choose, I would definitely lean towards making long(er) searches faster, instead of optimizing startup time... which is completely negligible unless you're running hundreds of searches inside tiny directories. But then you'Re probably using a script (where you can easily tune the number of --threads).

Now all that being said: if the current strategy shows unfavorable benchmark results on machines with N_cores ≫ 8, I'd be happy to do implement something like min(N_cores, 12) as a default.

Also, we digress. As @tavianator pointed out, this ticket is about WSL. So maybe let's get back to that topic and open a new ticket to discuss a better default --threads strategy (with actual benchmark results).

Dec 05 '22 20:12 sharkdp

agreed on all. and sorry for peppering this ticket with what probably deserves its own ticket!

does an increasing thread count increase the startup time simply due to the cost of starting up and tearing down the threads? also, wouldn't one hit some I/O bottleneck pretty quickly past N threads? (where N is some low number that is certainly significantly below 32, for example?) I know my CPU is probably unusual, but the "lower" Threadrippers are probably not that uncommon (perhaps most notably, Linus Torvald's)

Dec 05 '22 20:12 pmarreck

There has been some work recently that should have improved this.

Jan 08 '24 05:01 tmccombs

True, I think this can be closed for now. Please report back if this should still be an issue.

Jan 16 '24 19:01 sharkdp

I'm not sure this is fixed. This particular report is WSL-specific. Someone should at least check fd 9.0 on WSL2 before we close it.

(I only have a Windows VM, so WSL2 would be nested virt so probably not a fair test.)

Jan 16 '24 19:01 tavianator

@tavianator @sharkdp I was tracking down why mold was running slower than lld and default linker in wsl2 and found this. Nonetheless:

~
❯ fd --version
fdfind 9.0.0

~
❯ hyperfine -w 50 "fd" "fd -j1" -N
Benchmark 1: fd
  Time (mean ± σ):       5.8 ms ±   0.8 ms    [User: 12.0 ms, System: 3.3 ms]
  Range (min … max):     4.1 ms …   9.2 ms    480 runs

Benchmark 2: fd -j1
  Time (mean ± σ):       8.0 ms ±   1.6 ms    [User: 6.7 ms, System: 2.7 ms]
  Range (min … max):     4.8 ms …  16.0 ms    526 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  fd ran
    1.38 ± 0.33 times faster than fd -j1

It seems improved. It's a hassle to optimize for with the weird threading behavior, but it's very much appreciated, thank you guys for all the hard work.

and just for reference, on v8.7.1

~
❯ hyperfine -w 50 "./fdfind/fd-v8.7.1-x86_64-unknown-linux-gnu/fd" "./fdfind/fd-v8.7.1-x86_64-unknown-linux-gnu/fd -j1"
-N
Benchmark 1: ./fdfind/fd-v8.7.1-x86_64-unknown-linux-gnu/fd
  Time (mean ± σ):      24.4 ms ±   3.2 ms    [User: 6.7 ms, System: 29.4 ms]
  Range (min … max):    17.3 ms …  49.7 ms    114 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: ./fdfind/fd-v8.7.1-x86_64-unknown-linux-gnu/fd -j1
  Time (mean ± σ):       8.1 ms ±   1.5 ms    [User: 4.3 ms, System: 2.8 ms]
  Range (min … max):     5.3 ms …  13.0 ms    487 runs

Summary
  ./fdfind/fd-v8.7.1-x86_64-unknown-linux-gnu/fd -j1 ran
    3.03 ± 0.68 times faster than ./fdfind/fd-v8.7.1-x86_64-unknown-linux-gnu/fd

Jan 30 '24 05:01 WanderLanz

@WanderLanz Thanks for re-testing this! Looks like it's fixed.

Mar 31 '24 14:03 tavianator

fd fd copied to clipboard

fd is much slower when run with multiple threads

fd
fd copied to clipboard