fd
fd copied to clipboard
fd performance decreases with increasing cores
On a fresh clone on the chromium repo, if I try to list files using fd
, here's the result I get
$ time fd --color=never --hidden --type f | wc -l
410215
fd --color=never --hidden --type f 1.39s user 1.03s system 496% cpu 0.487 total
When I use rg
for the same purpose
$ time rg --color=never --hidden --no-messages --files | wc -l
410215
rg --color=never --hidden --no-messages --files 1.86s user 0.67s system 737% cpu 0.343 total
Notice how the %cpu used in case of rg is higher than with fd. My CPU has 8 cores and the usage reported for rg is what I would expect if it is using all my cores. I suspected that fd
might not be detecting the number of cores in my machine properly so I tried using the -j
flag to specify the number of cores manually. Here are some interesting results
$ for threads in `seq 1 8`; do time fd -j "$threads" --color=never --hidden --type f | wc -l; done
$ for threads in `seq 1 8`; do time rg -j "$threads" --color=never --hidden --no-messages --files | wc -l; done
threads | fd CPU Utilization | rg CPU Utilization |
---|---|---|
1 | 133% | 99% |
2 | 256% | 239% |
3 | 371% | 352% |
4 | 485% | 457% |
5 | 548% | 560% |
6 | 577% | 663% |
7 | 498% | 750% |
8 | 519% | 747% |
The user time seems to be correlated to the CPU utilization. This also ends up in fd
being slightly slower than rg
on my machine. Is there an explanation for this behaviour of fd
?
OS: Ubuntu 14.04 CPU: 8 x Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz RAM: 64G
$ fd --version
fd 8.1.1
$ rg --version
ripgrep 12.1.0
-SIMD -AVX (compiled)
+SIMD +AVX (runtime)
Ok, a couple of things:
- Where can I get the chromium repo to reproduce this? I cloned something from GitHub but I get a completely different file count.
- You are running
fd
andrg
without the--no-ignore
options. This means that both tools will look for.rgignore
or.fdignore
files on your system. For me, the two file counts differed because I have a~/.rgignore
file. Even if the file counts are the same for you, that's something worth checking out to make the benchmark fair. - I can't tell if you are properly running your benchmarks. Measuring the execution time of I/O heavy programs is not trivial. (1) Results are highly dependent on whether or not the disk cache is warm. If you run program A first, then B, you will see better times for B because the cache is warm. Did you correct for that? (2) I/O heavy programs are influenced by all kinds of processes running in the background. Are you making sure that your system is quiet and the comparison is fair?
Notice how the %cpu used in case of rg is higher than with fd. My CPU has 8 cores and the usage reported for rg is what I would expect if it is using all my cores.
That might very well be a caching effect. fd
is typically I/O-bound, not CPU bound.
Here are some interesting results
What exactly is interesting about these results? I don't see any error bars on these numbers, so I can't tell how representative they are. This could be just random fluctuations around j * 100%
?
I have performed the following warm-cache benchmark with hyperfine
:
hyperfine \
--parameter-scan threads 1 10 \
--warmup 3 \
--export-json results-fd.json \
--export-markdown results-fd.md \
"fd -j {threads} --color=never --hidden --type f"
hyperfine \
--parameter-scan threads 1 10 \
--warmup 3 \
--export-json results-rg.json \
--export-markdown results-rg.md \
"rg -j {threads} --color=never --hidden --no-messages --files"
# plot results:
hyperfine/scripts/plot_parametrized.py \
results-fd.json \
results-rg.json \
--titles "fd,rg" \
--parameter-name "number of threads"
This leads to the following results on my Laptop (8 x Intel(R) Core(TM) i7-4700MQ CPU @ 2.40GHz, same rg
and fd
versions as you):
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
fd -j 1 --color=never --hidden --type f |
1.378 ± 0.009 | 1.364 | 1.390 | 3.48 ± 0.02 |
fd -j 2 --color=never --hidden --type f |
0.750 ± 0.004 | 0.744 | 0.754 | 1.90 ± 0.01 |
fd -j 3 --color=never --hidden --type f |
0.543 ± 0.012 | 0.519 | 0.563 | 1.37 ± 0.03 |
fd -j 4 --color=never --hidden --type f |
0.462 ± 0.023 | 0.436 | 0.518 | 1.17 ± 0.06 |
fd -j 5 --color=never --hidden --type f |
0.427 ± 0.005 | 0.421 | 0.437 | 1.08 ± 0.01 |
fd -j 6 --color=never --hidden --type f |
0.412 ± 0.008 | 0.405 | 0.432 | 1.04 ± 0.02 |
fd -j 7 --color=never --hidden --type f |
0.396 ± 0.001 | 0.394 | 0.398 | 1.00 |
fd -j 8 --color=never --hidden --type f |
0.411 ± 0.010 | 0.399 | 0.432 | 1.04 ± 0.02 |
fd -j 9 --color=never --hidden --type f |
0.418 ± 0.012 | 0.395 | 0.431 | 1.06 ± 0.03 |
fd -j 10 --color=never --hidden --type f |
0.427 ± 0.019 | 0.404 | 0.472 | 1.08 ± 0.05 |
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
rg -j 1 --color=never --hidden --no-messages --files |
1.456 ± 0.012 | 1.439 | 1.477 | 3.17 ± 0.03 |
rg -j 2 --color=never --hidden --no-messages --files |
0.947 ± 0.005 | 0.937 | 0.956 | 2.06 ± 0.02 |
rg -j 3 --color=never --hidden --no-messages --files |
0.696 ± 0.017 | 0.671 | 0.725 | 1.52 ± 0.04 |
rg -j 4 --color=never --hidden --no-messages --files |
0.574 ± 0.021 | 0.546 | 0.619 | 1.25 ± 0.05 |
rg -j 5 --color=never --hidden --no-messages --files |
0.525 ± 0.018 | 0.507 | 0.549 | 1.14 ± 0.04 |
rg -j 6 --color=never --hidden --no-messages --files |
0.485 ± 0.005 | 0.480 | 0.496 | 1.06 ± 0.01 |
rg -j 7 --color=never --hidden --no-messages --files |
0.460 ± 0.002 | 0.457 | 0.465 | 1.00 |
rg -j 8 --color=never --hidden --no-messages --files |
0.460 ± 0.002 | 0.458 | 0.466 | 1.00 ± 0.01 |
rg -j 9 --color=never --hidden --no-messages --files |
0.465 ± 0.008 | 0.459 | 0.485 | 1.01 ± 0.02 |
rg -j 10 --color=never --hidden --no-messages --files |
0.473 ± 0.010 | 0.460 | 0.486 | 1.03 ± 0.02 |
fd
is consistently 5%-20% faster. Both programs have a minimum execution time for j=7 threads, which could be explained by the fact that we have one receiver thread in addition. It collects all results and prints them. Given that this benchmark is about listing all files, a lot of printing needs to be done and this thread could be quite busy, making j=N-1 the optimal amount of worker threads.
I'm happy to investigate further if you can clearly show that something is not as fast as expected in your setting.
Thanks for the quick response
Where can I get the chromium repo to reproduce this? I cloned something from GitHub but I get a completely different file count
Sorry, my bad, I just realized that I was running the commands outside the chromium repo so there were other files too. But re-running on the repo yields similar results.
You are running fd and rg without the --no-ignore options. This means that both tools will look for .rgignore or .fdignore files on your system. For me, the two file counts differed because I have a ~/.rgignore file. Even if the file counts are the same for you, that's something worth checking out to make the benchmark fair.
My rgignore and fdignore files are identical, they contain .git
and .cache
.
(1) Results are highly dependent on whether or not the disk cache is warm (2) I/O heavy programs are influenced by all kinds of processes running in the background. Are you making sure that your system is quiet and the comparison is fair
Agreed. I didn't know what what would be a good way to eliminate the impact of these factors so I ran both the commands several times (> 20) in different orders but yes, my approach wasn't very systematic. My system did have some browser windows open but there weren't any IO intensive processes running at that time. I don't know how to tell if the comparison was fair but I would expect it to be because I didn't start any other processes while I was testing the performance. All the measurements were taken at the same time.
That might very well be a caching effect. fd is typically I/O-bound, not CPU bound.
Wouldn't that be the case for rg
too?
What exactly is interesting about these results?
I was referring to the observation that the cpu utilization was not monotonically increasing for fd
.
I took a look at hyperfine, very cool! I used it to run some benchmarks and found something interesting.
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
fd -j 1 --color=never --hidden --type f |
994.2 ± 11.9 | 980.8 | 1014.8 | 3.23 ± 0.06 |
fd -j 2 --color=never --hidden --type f |
615.3 ± 39.6 | 561.7 | 677.1 | 2.00 ± 0.13 |
fd -j 3 --color=never --hidden --type f |
517.9 ± 5.6 | 509.9 | 526.8 | 1.68 ± 0.03 |
fd -j 4 --color=never --hidden --type f |
421.7 ± 3.2 | 418.7 | 428.8 | 1.37 ± 0.02 |
fd -j 5 --color=never --hidden --type f |
361.4 ± 5.1 | 352.4 | 367.4 | 1.17 ± 0.02 |
fd -j 6 --color=never --hidden --type f |
327.4 ± 3.9 | 322.5 | 333.5 | 1.06 ± 0.02 |
fd -j 7 --color=never --hidden --type f |
307.8 ± 4.4 | 302.2 | 313.9 | 1.00 |
fd -j 8 --color=never --hidden --type f |
308.3 ± 3.1 | 302.8 | 314.0 | 1.00 ± 0.02 |
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
fd -j 1 --color=never --hidden --type f | wc -l |
1.135 ± 0.079 | 1.045 | 1.246 | 2.97 ± 0.21 |
fd -j 2 --color=never --hidden --type f | wc -l |
0.723 ± 0.007 | 0.712 | 0.733 | 1.89 ± 0.03 |
fd -j 3 --color=never --hidden --type f | wc -l |
0.515 ± 0.004 | 0.510 | 0.521 | 1.35 ± 0.02 |
fd -j 4 --color=never --hidden --type f | wc -l |
0.413 ± 0.002 | 0.409 | 0.416 | 1.08 ± 0.02 |
fd -j 5 --color=never --hidden --type f | wc -l |
0.383 ± 0.005 | 0.374 | 0.390 | 1.00 ± 0.02 |
fd -j 6 --color=never --hidden --type f | wc -l |
0.382 ± 0.005 | 0.373 | 0.393 | 1.00 |
fd -j 7 --color=never --hidden --type f | wc -l |
0.470 ± 0.007 | 0.452 | 0.476 | 1.23 ± 0.03 |
fd -j 8 --color=never --hidden --type f | wc -l |
0.472 ± 0.011 | 0.459 | 0.489 | 1.23 ± 0.03 |
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
rg -j 1 --color=never --hidden --no-messages --files |
1.018 ± 0.006 | 1.008 | 1.027 | 3.11 ± 0.06 |
rg -j 2 --color=never --hidden --no-messages --files |
0.723 ± 0.033 | 0.684 | 0.794 | 2.21 ± 0.11 |
rg -j 3 --color=never --hidden --no-messages --files |
0.618 ± 0.006 | 0.607 | 0.623 | 1.89 ± 0.04 |
rg -j 4 --color=never --hidden --no-messages --files |
0.489 ± 0.004 | 0.484 | 0.496 | 1.50 ± 0.03 |
rg -j 5 --color=never --hidden --no-messages --files |
0.410 ± 0.003 | 0.403 | 0.415 | 1.25 ± 0.03 |
rg -j 6 --color=never --hidden --no-messages --files |
0.360 ± 0.003 | 0.353 | 0.363 | 1.10 ± 0.02 |
rg -j 7 --color=never --hidden --no-messages --files |
0.327 ± 0.006 | 0.315 | 0.336 | 1.00 |
rg -j 8 --color=never --hidden --no-messages --files |
0.331 ± 0.004 | 0.326 | 0.337 | 1.01 ± 0.02 |
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
rg -j 1 --color=never --hidden --no-messages --files | wc -l |
1.028 ± 0.004 | 1.023 | 1.034 | 3.10 ± 0.06 |
rg -j 2 --color=never --hidden --no-messages --files | wc -l |
0.897 ± 0.016 | 0.871 | 0.928 | 2.71 ± 0.07 |
rg -j 3 --color=never --hidden --no-messages --files | wc -l |
0.632 ± 0.008 | 0.618 | 0.644 | 1.91 ± 0.04 |
rg -j 4 --color=never --hidden --no-messages --files | wc -l |
0.490 ± 0.004 | 0.482 | 0.495 | 1.48 ± 0.03 |
rg -j 5 --color=never --hidden --no-messages --files | wc -l |
0.411 ± 0.003 | 0.406 | 0.414 | 1.24 ± 0.03 |
rg -j 6 --color=never --hidden --no-messages --files | wc -l |
0.363 ± 0.005 | 0.351 | 0.369 | 1.10 ± 0.03 |
rg -j 7 --color=never --hidden --no-messages --files | wc -l |
0.331 ± 0.007 | 0.318 | 0.339 | 1.00 |
rg -j 8 --color=never --hidden --no-messages --files | wc -l |
0.332 ± 0.007 | 0.318 | 0.340 | 1.00 ± 0.03 |
fd
becomes slower when piping out the results to wc
but rg
seems to be unaffected by that. Without wc
, fd
performs slightly better than rg
.
I can try to do more systematic benchmarks if you can guide me on how to do that
That is interesting indeed!
I can reproduce this:
I have a feeling this might have something to do with the way we write output to the terminal (buffered vs unbuffered). We should definitely take a look at that. Thank you for reporting this and for the detailed analysis!
Wouldn't that be the case for
rg
too?
yes.
In a vacuum, it looks like wrapping stdout in a BufWriter can make a huge difference. I wrote this test program that prints "Hello World" a bunch.
use std::io::{self, stdout, Write, BufWriter};
fn main() -> io::Result<()> {
let args: Vec::<_> = std::env::args().collect();
let mut output: Box<dyn Write> = match args[1].as_str() {
"b" => Box::new(BufWriter::new(stdout())),
"u" => Box::new(stdout()),
_ => panic!("invalid arg"),
};
let count: usize = args[2].parse().unwrap();
let message = b"Hello World\n";
for _ in 0..count {
output.write(message)?;
}
Ok(())
}
According to a couple hyperfine benchmarks, the buffered version is about 50 times faster than the unbuffered version. I'm not sure why adding extra threads would slow things down, perhaps there's a lot of lock contention happening.
It looks like fd
currently prints using the object returned by io::stdout().lock()
without any extra buffering added.
That could be changed to a Box<dyn Write>
trait object that's set up based on whether stdout is a TTY, or fd
could support a --count
option which prints the number of files found to replace the fd ... | wc -l
usage paradigm.
(I'd be happy to implement one or both of these changes)
@aswild Thank you for investigating. That's what I was thinking - exactly.
I'm not sure why adding extra threads would slow things down, perhaps there's a lot of lock contention happening.
That's the unsolved question, yes. There shouldn't be any lock contention. Note that the printing only happens in the "receiver" thread, not in any of the worker/sender/searcher threads. I think we should even be able to acquire the lock only once.
Adding a BufWriter
for non-interactive use cases sounds good. However, I would like to know what kind of drawbacks this could have. What if we pipe the fd
output to other programs where we would be interested in immediate feedback? Do we know what find
does?
or
fd
could support a--count
option which prints the number of files found to replace thefd ... | wc -l
usage paradigm.
A --count
option has never been requested so far. If it's not strictly necessary, I'd like to avoid adding it.
Thanks @aswild
fd could support a --count option
I agree with @sharkdp on avoiding the --count
option since counting isn't really the problem.
What if we pipe the fd output to other programs where we would be interested in immediate feedback
I have seen other unix tools with a -u
or --unbuffered
flag which forces them to write without buffering (alternatively, some applications take a --interactive
flag), perhaps fd
can take the same approach.
It might be too early to jump to solutions though since we don't understand the root cause of the problem
What if we pipe the
fd
output to other programs where we would be interested in immediate feedback?
Good point. If I ran fd foo | tee results.txt
I'd still want to see output line-by-line, and fd
wouldn't be able to tell the difference.
the printing only happens in the "receiver" thread, not in any of the worker/sender/searcher threads.
Ah, ok. I haven't looked at fd's internals that deeply yet
It sounds like some profiling might be in order to figure out where the hot spots are.
How do other tools solve this? I get that there are some programs where the behavior can be chosen with a flag. But what would be a reasonable default? Or is there a way to improve the situation without the drawbacks?
Although, I can't remember which off the top of my head, I think reminder there being at least one program that detects if stdout is a tty as a heuristic, but has a flag to override how output is buffered.
Was this ever implemented? I'd be interested in working on this.
@Spaceface16518 It's not clear to me (yet) what exactly should be implemented. I think we need further investigation first.
@sharkdp is this issue still relevant? it seems like receiver thread is already using buffered writer (see code). It even locks for the full duration it to avoid contention and hyperfine reports expected perf results. Seems like it should be closed
@arriven Thank you for the reminder. Indeed. I can not reproduce this anymore:
hyperfine \
--parameter-scan threads 1 10 \
--warmup 3 \
--export-json results-fd.json \
--export-markdown results-fd.md \
"fd -j {threads} --color=never --hidden --type f | wc -l"
hyperfine \
--parameter-scan threads 1 10 \
--warmup 3 \
--export-json results-rg.json \
--export-markdown results-rg.md \
"rg -j {threads} --color=never --hidden --no-messages --files | wc -l"
hyperfine/scripts/plot_parametrized.py \
results-fd.json \
results-rg.json \
--titles "fd,rg"