fd fd performance decreases with increasing cores

On a fresh clone on the chromium repo, if I try to list files using fd, here's the result I get

$ time fd --color=never --hidden --type f | wc -l
410215
fd --color=never --hidden --type f  1.39s user 1.03s system 496% cpu 0.487 total

When I use rg for the same purpose

$ time rg --color=never --hidden --no-messages --files | wc -l
410215
rg --color=never --hidden --no-messages --files  1.86s user 0.67s system 737% cpu 0.343 total

Notice how the %cpu used in case of rg is higher than with fd. My CPU has 8 cores and the usage reported for rg is what I would expect if it is using all my cores. I suspected that fd might not be detecting the number of cores in my machine properly so I tried using the -j flag to specify the number of cores manually. Here are some interesting results

$ for threads in `seq 1 8`; do time fd -j "$threads" --color=never --hidden --type f | wc -l; done
$ for threads in `seq 1 8`; do time rg -j "$threads" --color=never --hidden --no-messages --files | wc -l; done

threads	fd CPU Utilization	rg CPU Utilization
1	133%	99%
2	256%	239%
3	371%	352%
4	485%	457%
5	548%	560%
6	577%	663%
7	498%	750%
8	519%	747%

The user time seems to be correlated to the CPU utilization. This also ends up in fd being slightly slower than rg on my machine. Is there an explanation for this behaviour of fd?

OS: Ubuntu 14.04 CPU: 8 x Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz RAM: 64G

$ fd --version
fd 8.1.1

$ rg --version
ripgrep 12.1.0
-SIMD -AVX (compiled)
+SIMD +AVX (runtime)

Jun 10 '20 17:06 lakshayg

Ok, a couple of things:

Where can I get the chromium repo to reproduce this? I cloned something from GitHub but I get a completely different file count.
You are running fd and rg without the --no-ignore options. This means that both tools will look for .rgignore or .fdignore files on your system. For me, the two file counts differed because I have a ~/.rgignore file. Even if the file counts are the same for you, that's something worth checking out to make the benchmark fair.
I can't tell if you are properly running your benchmarks. Measuring the execution time of I/O heavy programs is not trivial. (1) Results are highly dependent on whether or not the disk cache is warm. If you run program A first, then B, you will see better times for B because the cache is warm. Did you correct for that? (2) I/O heavy programs are influenced by all kinds of processes running in the background. Are you making sure that your system is quiet and the comparison is fair?

Notice how the %cpu used in case of rg is higher than with fd. My CPU has 8 cores and the usage reported for rg is what I would expect if it is using all my cores.

That might very well be a caching effect. fd is typically I/O-bound, not CPU bound.

Here are some interesting results

What exactly is interesting about these results? I don't see any error bars on these numbers, so I can't tell how representative they are. This could be just random fluctuations around j * 100%?

I have performed the following warm-cache benchmark with hyperfine:

hyperfine \
    --parameter-scan threads 1 10 \
    --warmup 3 \
    --export-json results-fd.json \
    --export-markdown results-fd.md \
    "fd -j {threads} --color=never --hidden --type f"

hyperfine \
    --parameter-scan threads 1 10 \
    --warmup 3 \
    --export-json results-rg.json \
    --export-markdown results-rg.md \
    "rg -j {threads} --color=never --hidden --no-messages --files"

# plot results:
hyperfine/scripts/plot_parametrized.py \
  results-fd.json \
  results-rg.json \
  --titles "fd,rg" \
  --parameter-name "number of threads"

This leads to the following results on my Laptop (8 x Intel(R) Core(TM) i7-4700MQ CPU @ 2.40GHz, same rg and fd versions as you):

fd-vs-rg

Command	Mean [s]	Min [s]	Max [s]	Relative
`fd -j 1 --color=never --hidden --type f`	1.378 ± 0.009	1.364	1.390	3.48 ± 0.02
`fd -j 2 --color=never --hidden --type f`	0.750 ± 0.004	0.744	0.754	1.90 ± 0.01
`fd -j 3 --color=never --hidden --type f`	0.543 ± 0.012	0.519	0.563	1.37 ± 0.03
`fd -j 4 --color=never --hidden --type f`	0.462 ± 0.023	0.436	0.518	1.17 ± 0.06
`fd -j 5 --color=never --hidden --type f`	0.427 ± 0.005	0.421	0.437	1.08 ± 0.01
`fd -j 6 --color=never --hidden --type f`	0.412 ± 0.008	0.405	0.432	1.04 ± 0.02
`fd -j 7 --color=never --hidden --type f`	0.396 ± 0.001	0.394	0.398	1.00
`fd -j 8 --color=never --hidden --type f`	0.411 ± 0.010	0.399	0.432	1.04 ± 0.02
`fd -j 9 --color=never --hidden --type f`	0.418 ± 0.012	0.395	0.431	1.06 ± 0.03
`fd -j 10 --color=never --hidden --type f`	0.427 ± 0.019	0.404	0.472	1.08 ± 0.05

Command	Mean [s]	Min [s]	Max [s]	Relative
`rg -j 1 --color=never --hidden --no-messages --files`	1.456 ± 0.012	1.439	1.477	3.17 ± 0.03
`rg -j 2 --color=never --hidden --no-messages --files`	0.947 ± 0.005	0.937	0.956	2.06 ± 0.02
`rg -j 3 --color=never --hidden --no-messages --files`	0.696 ± 0.017	0.671	0.725	1.52 ± 0.04
`rg -j 4 --color=never --hidden --no-messages --files`	0.574 ± 0.021	0.546	0.619	1.25 ± 0.05
`rg -j 5 --color=never --hidden --no-messages --files`	0.525 ± 0.018	0.507	0.549	1.14 ± 0.04
`rg -j 6 --color=never --hidden --no-messages --files`	0.485 ± 0.005	0.480	0.496	1.06 ± 0.01
`rg -j 7 --color=never --hidden --no-messages --files`	0.460 ± 0.002	0.457	0.465	1.00
`rg -j 8 --color=never --hidden --no-messages --files`	0.460 ± 0.002	0.458	0.466	1.00 ± 0.01
`rg -j 9 --color=never --hidden --no-messages --files`	0.465 ± 0.008	0.459	0.485	1.01 ± 0.02
`rg -j 10 --color=never --hidden --no-messages --files`	0.473 ± 0.010	0.460	0.486	1.03 ± 0.02

fd is consistently 5%-20% faster. Both programs have a minimum execution time for j=7 threads, which could be explained by the fact that we have one receiver thread in addition. It collects all results and prints them. Given that this benchmark is about listing all files, a lot of printing needs to be done and this thread could be quite busy, making j=N-1 the optimal amount of worker threads.

I'm happy to investigate further if you can clearly show that something is not as fast as expected in your setting.

Jun 11 '20 09:06 sharkdp

Thanks for the quick response

Where can I get the chromium repo to reproduce this? I cloned something from GitHub but I get a completely different file count

Sorry, my bad, I just realized that I was running the commands outside the chromium repo so there were other files too. But re-running on the repo yields similar results.

You are running fd and rg without the --no-ignore options. This means that both tools will look for .rgignore or .fdignore files on your system. For me, the two file counts differed because I have a ~/.rgignore file. Even if the file counts are the same for you, that's something worth checking out to make the benchmark fair.

My rgignore and fdignore files are identical, they contain .git and .cache.

(1) Results are highly dependent on whether or not the disk cache is warm (2) I/O heavy programs are influenced by all kinds of processes running in the background. Are you making sure that your system is quiet and the comparison is fair

Agreed. I didn't know what what would be a good way to eliminate the impact of these factors so I ran both the commands several times (> 20) in different orders but yes, my approach wasn't very systematic. My system did have some browser windows open but there weren't any IO intensive processes running at that time. I don't know how to tell if the comparison was fair but I would expect it to be because I didn't start any other processes while I was testing the performance. All the measurements were taken at the same time.

That might very well be a caching effect. fd is typically I/O-bound, not CPU bound.

Wouldn't that be the case for rg too?

What exactly is interesting about these results?

I was referring to the observation that the cpu utilization was not monotonically increasing for fd.

I took a look at hyperfine, very cool! I used it to run some benchmarks and found something interesting.

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`fd -j 1 --color=never --hidden --type f`	994.2 ± 11.9	980.8	1014.8	3.23 ± 0.06
`fd -j 2 --color=never --hidden --type f`	615.3 ± 39.6	561.7	677.1	2.00 ± 0.13
`fd -j 3 --color=never --hidden --type f`	517.9 ± 5.6	509.9	526.8	1.68 ± 0.03
`fd -j 4 --color=never --hidden --type f`	421.7 ± 3.2	418.7	428.8	1.37 ± 0.02
`fd -j 5 --color=never --hidden --type f`	361.4 ± 5.1	352.4	367.4	1.17 ± 0.02
`fd -j 6 --color=never --hidden --type f`	327.4 ± 3.9	322.5	333.5	1.06 ± 0.02
`fd -j 7 --color=never --hidden --type f`	307.8 ± 4.4	302.2	313.9	1.00
`fd -j 8 --color=never --hidden --type f`	308.3 ± 3.1	302.8	314.0	1.00 ± 0.02

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`fd -j 1 --color=never --hidden --type f \| wc -l`	1.135 ± 0.079	1.045	1.246	2.97 ± 0.21
`fd -j 2 --color=never --hidden --type f \| wc -l`	0.723 ± 0.007	0.712	0.733	1.89 ± 0.03
`fd -j 3 --color=never --hidden --type f \| wc -l`	0.515 ± 0.004	0.510	0.521	1.35 ± 0.02
`fd -j 4 --color=never --hidden --type f \| wc -l`	0.413 ± 0.002	0.409	0.416	1.08 ± 0.02
`fd -j 5 --color=never --hidden --type f \| wc -l`	0.383 ± 0.005	0.374	0.390	1.00 ± 0.02
`fd -j 6 --color=never --hidden --type f \| wc -l`	0.382 ± 0.005	0.373	0.393	1.00
`fd -j 7 --color=never --hidden --type f \| wc -l`	0.470 ± 0.007	0.452	0.476	1.23 ± 0.03
`fd -j 8 --color=never --hidden --type f \| wc -l`	0.472 ± 0.011	0.459	0.489	1.23 ± 0.03

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`rg -j 1 --color=never --hidden --no-messages --files`	1.018 ± 0.006	1.008	1.027	3.11 ± 0.06
`rg -j 2 --color=never --hidden --no-messages --files`	0.723 ± 0.033	0.684	0.794	2.21 ± 0.11
`rg -j 3 --color=never --hidden --no-messages --files`	0.618 ± 0.006	0.607	0.623	1.89 ± 0.04
`rg -j 4 --color=never --hidden --no-messages --files`	0.489 ± 0.004	0.484	0.496	1.50 ± 0.03
`rg -j 5 --color=never --hidden --no-messages --files`	0.410 ± 0.003	0.403	0.415	1.25 ± 0.03
`rg -j 6 --color=never --hidden --no-messages --files`	0.360 ± 0.003	0.353	0.363	1.10 ± 0.02
`rg -j 7 --color=never --hidden --no-messages --files`	0.327 ± 0.006	0.315	0.336	1.00
`rg -j 8 --color=never --hidden --no-messages --files`	0.331 ± 0.004	0.326	0.337	1.01 ± 0.02

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`rg -j 1 --color=never --hidden --no-messages --files \| wc -l`	1.028 ± 0.004	1.023	1.034	3.10 ± 0.06
`rg -j 2 --color=never --hidden --no-messages --files \| wc -l`	0.897 ± 0.016	0.871	0.928	2.71 ± 0.07
`rg -j 3 --color=never --hidden --no-messages --files \| wc -l`	0.632 ± 0.008	0.618	0.644	1.91 ± 0.04
`rg -j 4 --color=never --hidden --no-messages --files \| wc -l`	0.490 ± 0.004	0.482	0.495	1.48 ± 0.03
`rg -j 5 --color=never --hidden --no-messages --files \| wc -l`	0.411 ± 0.003	0.406	0.414	1.24 ± 0.03
`rg -j 6 --color=never --hidden --no-messages --files \| wc -l`	0.363 ± 0.005	0.351	0.369	1.10 ± 0.03
`rg -j 7 --color=never --hidden --no-messages --files \| wc -l`	0.331 ± 0.007	0.318	0.339	1.00
`rg -j 8 --color=never --hidden --no-messages --files \| wc -l`	0.332 ± 0.007	0.318	0.340	1.00 ± 0.03

fd becomes slower when piping out the results to wc but rg seems to be unaffected by that. Without wc, fd performs slightly better than rg.

I can try to do more systematic benchmarks if you can guide me on how to do that

Jun 12 '20 04:06 lakshayg

That is interesting indeed!

I can reproduce this: fd-vs-rg-with-wc

I have a feeling this might have something to do with the way we write output to the terminal (buffered vs unbuffered). We should definitely take a look at that. Thank you for reporting this and for the detailed analysis!

Wouldn't that be the case for rg too?

yes.

Jun 12 '20 05:06 sharkdp

In a vacuum, it looks like wrapping stdout in a BufWriter can make a huge difference. I wrote this test program that prints "Hello World" a bunch.

use std::io::{self, stdout, Write, BufWriter};

fn main() -> io::Result<()> {
    let args: Vec::<_> = std::env::args().collect();
    let mut output: Box<dyn Write> = match args[1].as_str() {
        "b" => Box::new(BufWriter::new(stdout())),
        "u" => Box::new(stdout()),
        _ => panic!("invalid arg"),
    };
    let count: usize = args[2].parse().unwrap();

    let message = b"Hello World\n";
    for _ in 0..count {
        output.write(message)?;
    }

    Ok(())
}

According to a couple hyperfine benchmarks, the buffered version is about 50 times faster than the unbuffered version. I'm not sure why adding extra threads would slow things down, perhaps there's a lot of lock contention happening.

It looks like fd currently prints using the object returned by io::stdout().lock() without any extra buffering added.

That could be changed to a Box<dyn Write> trait object that's set up based on whether stdout is a TTY, or fd could support a --count option which prints the number of files found to replace the fd ... | wc -l usage paradigm.

(I'd be happy to implement one or both of these changes)

Jun 16 '20 22:06 aswild

@aswild Thank you for investigating. That's what I was thinking - exactly.

I'm not sure why adding extra threads would slow things down, perhaps there's a lot of lock contention happening.

That's the unsolved question, yes. There shouldn't be any lock contention. Note that the printing only happens in the "receiver" thread, not in any of the worker/sender/searcher threads. I think we should even be able to acquire the lock only once.

Adding a BufWriter for non-interactive use cases sounds good. However, I would like to know what kind of drawbacks this could have. What if we pipe the fd output to other programs where we would be interested in immediate feedback? Do we know what find does?

or fd could support a --count option which prints the number of files found to replace the fd ... | wc -l usage paradigm.

A --count option has never been requested so far. If it's not strictly necessary, I'd like to avoid adding it.

Jun 17 '20 05:06 sharkdp

Thanks @aswild

fd could support a --count option

I agree with @sharkdp on avoiding the --count option since counting isn't really the problem.

What if we pipe the fd output to other programs where we would be interested in immediate feedback

I have seen other unix tools with a -u or --unbuffered flag which forces them to write without buffering (alternatively, some applications take a --interactive flag), perhaps fd can take the same approach.

It might be too early to jump to solutions though since we don't understand the root cause of the problem

Jun 17 '20 05:06 lakshayg

What if we pipe the fd output to other programs where we would be interested in immediate feedback?

Good point. If I ran fd foo | tee results.txt I'd still want to see output line-by-line, and fd wouldn't be able to tell the difference.

the printing only happens in the "receiver" thread, not in any of the worker/sender/searcher threads.

Ah, ok. I haven't looked at fd's internals that deeply yet

It sounds like some profiling might be in order to figure out where the hot spots are.

Jun 17 '20 22:06 aswild

How do other tools solve this? I get that there are some programs where the behavior can be chosen with a flag. But what would be a reasonable default? Or is there a way to improve the situation without the drawbacks?

Jul 05 '20 11:07 sharkdp

Although, I can't remember which off the top of my head, I think reminder there being at least one program that detects if stdout is a tty as a heuristic, but has a flag to override how output is buffered.

Jul 05 '20 20:07 tmccombs

Was this ever implemented? I'd be interested in working on this.

Oct 08 '20 23:10 Spaceface16518

@Spaceface16518 It's not clear to me (yet) what exactly should be implemented. I think we need further investigation first.

Oct 25 '20 07:10 sharkdp

@sharkdp is this issue still relevant? it seems like receiver thread is already using buffered writer (see code). It even locks for the full duration it to avoid contention and hyperfine reports expected perf results. Seems like it should be closed

Oct 03 '22 18:10 arriven

@arriven Thank you for the reminder. Indeed. I can not reproduce this anymore:

hyperfine \
    --parameter-scan threads 1 10 \
    --warmup 3 \
    --export-json results-fd.json \
    --export-markdown results-fd.md \
    "fd -j {threads} --color=never --hidden --type f | wc -l"

hyperfine \
    --parameter-scan threads 1 10 \
    --warmup 3 \
    --export-json results-rg.json \
    --export-markdown results-rg.md \
    "rg -j {threads} --color=never --hidden --no-messages --files | wc -l"

hyperfine/scripts/plot_parametrized.py \
  results-fd.json \
  results-rg.json \
  --titles "fd,rg"

Nov 19 '22 21:11 sharkdp

fd fd copied to clipboard

fd performance decreases with increasing cores

fd
fd copied to clipboard