fd
fd copied to clipboard
Benchmark is unfair. find(1) should not be used for grepping.
Hi,
I believe the benchmarks you provide compared to find(1) are unfair. find(1) should not be used for grepping; following the Unix principles, find(1) should just find, and grep(1) should be responsible for filtering the output of find(1).
If we pipe find(1) to grep(1), the performance is significantly faster than just using find(1):
$ time find ~ -iregex '.*[0-9]\.c$' >/dev/null
real 0m1.103s
user 0m0.881s
sys 0m0.220s
$ time find ~ | grep -i '/[^/]*[0-9]\.c$' >/dev/null
real 0m0.346s
user 0m0.103s
sys 0m0.260s
find | grep
seems to be faster than fdfind
in my own simple test:
$ time fdfind -HI '.*[0-9]\.c$' ~ >/dev/null
real 0m0.383s
user 0m1.920s
sys 0m6.126s
Can you please provide benchmarks against this pipeline in your readme?
What version of fd
are you using?
[paste the output of fd --version
here]
$ dpkg -l | grep fd-find
ii fd-find 8.7.0-3+b1 amd64 Simple, fast and user-friendly alternative to find
$ fdfind --version
fdfind 8.7.0
Just for completeness, here's my CPU:
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Vendor ID: GenuineIntel
Model name: 13th Gen Intel(R) Core(TM) i9-13900T
CPU family: 6
Model: 183
Thread(s) per core: 1
Core(s) per socket: 24
...
Caches (sum of all):
L1d: 896 KiB (24 instances)
L1i: 1.3 MiB (24 instances)
L2: 32 MiB (12 instances)
L3: 36 MiB (1 instance)
Thanks!
I believe the benchmarks you provide compared to find(1) are unfair. find(1) should not be used for grepping; following the Unix principles, find(1) should just find, and grep(1) should be responsible for filtering the output of find(1).
find
has supported tests like -name
since antiquity, so I think it's an exaggeration to say they "should not" be used. And in the limit, find | grep foo
has to send arbitrarily more data through a pipe than find -name foo
outputs, so it has the potential to be much faster. (There are other reasons to prefer find -name
too, such as the fact that find | grep
is broken for filenames that contain newlines.)
Unfortunately, the regex implementation in GNU find
(which comes from Gnulib) is quite slow, so you can easily beat it by using the optimized one from grep
. Even just -iname
is faster than -iregex
, though not quite as fast as grep
:
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
find -iregex '.*[0-9]\.c$' |
671.9 ± 12.0 | 660.0 | 686.9 | 2.14 ± 0.04 |
find -iname '*[0-9].c' |
475.2 ± 13.4 | 463.2 | 492.9 | 1.51 ± 0.05 |
find | grep -i '[0-9]\.c$' |
314.0 ± 3.2 | 309.9 | 318.7 | 1.00 |
fd -u '[0-9]\.c$' |
333.2 ± 3.5 | 329.3 | 341.0 | 1.06 ± 0.02 |
fd -u | grep -i '[0-9]\.c$' |
337.2 ± 3.2 | 332.9 | 342.9 | 1.07 ± 0.02 |
This is on a checkout of rust-lang/rust (see the bfs
benchmarks). A build of fd
that includes https://github.com/BurntSushi/ripgrep/commit/d938e955af7a39ac0245b1ce84924f7f4a2141ac does much better:
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
find -iregex '.*[0-9]\.c$' |
663.9 ± 13.0 | 654.1 | 688.6 | 4.55 ± 0.54 |
find -iname '*[0-9].c' |
471.1 ± 13.9 | 459.8 | 489.6 | 3.23 ± 0.39 |
find | grep -i '[0-9]\.c$' |
306.4 ± 11.8 | 289.5 | 320.0 | 2.10 ± 0.26 |
fd -u '[0-9]\.c$' |
145.8 ± 17.2 | 123.7 | 174.2 | 1.00 |
fd -u | grep -i '[0-9]\.c$' |
210.7 ± 11.6 | 198.5 | 238.9 | 1.45 ± 0.19 |
I agree that the benchmarks should probably be updated, and we should make the comparisons as fair as possible. #893 is relevant.
I believe the benchmarks you provide compared to find(1) are unfair. find(1) should not be used for grepping; following the Unix principles, find(1) should just find, and grep(1) should be responsible for filtering the output of find(1).
find
has supported tests like-name
since antiquity, so I think it's an exaggeration to say they "should not" be used.
Relevant stuff: http://doc.cat-v.org/unix/find-history
I understand that people use them out of habit, but the design of the program is rather dubious. But yeah, I could change the "should not be used" to maybe "are suboptimal, both in performance terms, and in simplicity".
Piping to grep(1), you get faster results, and you don't even need to check the find(1) manual page for all the similar but different options that it has for grepping files.
And in the limit,
find | grep foo
has to send arbitrarily more data through a pipe thanfind -name foo
outputs, so it has the potential to be much faster.
The pipe shouldn't be a bottleneck. The bottleneck is usually I/O.
$ time find >/dev/null
real 0m0.327s
user 0m0.091s
sys 0m0.233s
$ time find | grep -i '[0-9]\.c$' >/dev/null
real 0m0.335s
user 0m0.098s
sys 0m0.250s
You can see that piping all filenames only adds a little bit of time to a simple find(1).
Also, not only the real time is important. fdfind(1), with the appropriate patches and optimizations may beat find | grep
by a slight advantage in real time, as you show, but it still uses heavy CPU time; it's just that it runs in parallel.
alx@debian:~$ time find | grep -i '[0-9]\.c$' >/dev/null
real 0m0.358s
user 0m0.098s
sys 0m0.276s
alx@debian:~$ time fdfind -u | grep -i '[0-9]\.c$' >/dev/null
real 0m0.390s
user 0m1.831s
sys 0m6.067s
alx@debian:~$ time fdfind -u '[0-9]\.c$' >/dev/null
real 0m0.385s
user 0m1.846s
sys 0m6.222s
That's fine if the bottleneck is in find(1), but if I pipe this command to a more consuming pipeline where find(1) is not the bottleneck, I fear that it may actually be slower, and will occupy the CPU that could be running other tasks.
If I have other heavy tasks at the same time, like compiling some software, it will also probably affect the performance of fdfind(1) significantly, while find | grep
wouldn't be affected as much, probably.
(There are other reasons to prefer
find -name
too, such as the fact thatfind | grep
is broken for filenames that contain newlines.)
Those filenames are already broken (they are unportable, according to POSIX https://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap04.html#tag_04_06). Please don't use them. :)
If you need them, though, you can use find -print0 | grep -z
.
Unfortunately, the regex implementation in GNU
find
(which comes from Gnulib) is quite slow, so you can easily beat it by using the optimized one fromgrep
. Even just-iname
is faster than-iregex
, though not quite as fast asgrep
: Command Mean [ms] Min [ms] Max [ms] Relativefind -iregex '.*[0-9]\.c$'
671.9 ± 12.0 660.0 686.9 2.14 ± 0.04find -iname '*[0-9].c'
475.2 ± 13.4 463.2 492.9 1.51 ± 0.05find | grep -i '[0-9]\.c$'
314.0 ± 3.2 309.9 318.7 1.00fd -u '[0-9]\.c$'
333.2 ± 3.5 329.3 341.0 1.06 ± 0.02fd -u | grep -i '[0-9]\.c$'
337.2 ± 3.2 332.9 342.9 1.07 ± 0.02This is on a checkout of rust-lang/rust (see the
bfs
benchmarks). A build offd
that includes BurntSushi/ripgrep@d938e95 does much better: Command Mean [ms] Min [ms] Max [ms] Relativefind -iregex '.*[0-9]\.c$'
663.9 ± 13.0 654.1 688.6 4.55 ± 0.54find -iname '*[0-9].c'
471.1 ± 13.9 459.8 489.6 3.23 ± 0.39find | grep -i '[0-9]\.c$'
306.4 ± 11.8 289.5 320.0 2.10 ± 0.26fd -u '[0-9]\.c$'
145.8 ± 17.2 123.7 174.2 1.00fd -u | grep -i '[0-9]\.c$'
210.7 ± 11.6 198.5 238.9 1.45 ± 0.19I agree that the benchmarks should probably be updated, and we should make the comparisons as fair as possible. #893 is relevant.
Thanks.
Relevant stuff: http://doc.cat-v.org/unix/find-history
Thanks for this link, very cool!
The pipe shouldn't be a bottleneck. The bottleneck is usually I/O.
This is often true, but in my personal uses the whole tree is usually in cache (dcache, or at least buffer cache). Most of the overhead then comes from kernel and syscall overhead.
You can see that piping all filenames only adds a little bit of time to a simple find(1).
That benchmark still has find
writing all the paths to stdout. This one shows the difference:
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
find -false |
2.928 ± 0.023 | 2.892 | 2.953 | 1.00 |
find >/dev/null |
3.076 ± 0.010 | 3.059 | 3.095 | 1.05 ± 0.01 |
find | cat >/dev/null |
3.129 ± 0.020 | 3.099 | 3.169 | 1.07 ± 0.01 |
Anyway, more important than things like -name
are things like -prune
, -xdev
, etc. that reduce the search space in a way that post-processing cannot.
Also, not only the real time is important. fdfind(1), with the appropriate patches and optimizations may beat
find | grep
by a slight advantage in real time, as you show, but it still uses heavy CPU time; it's just that it runs in parallel.That's fine if the bottleneck is in find(1), but if I pipe this command to a more consuming pipeline where find(1) is not the bottleneck, I fear that it may actually be slower, and will occupy the CPU that could be running other tasks.
Agreed. fd
tends to use a bit too much parallelism by default (see https://github.com/sharkdp/fd/issues/1203). But fd -j1
performance is also pretty good:
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
find -iregex '.*[0-9]\.c$' |
667.6 ± 12.7 | 655.4 | 685.7 | 2.42 ± 0.11 |
find -iname '*[0-9].c' |
472.0 ± 12.3 | 460.5 | 488.1 | 1.71 ± 0.08 |
find | grep -i '[0-9]\.c$' |
312.6 ± 7.8 | 291.0 | 317.2 | 1.13 ± 0.05 |
fd -j1 -u '[0-9]\.c$' |
276.1 ± 10.9 | 261.8 | 287.1 | 1.00 |
fd -j1 -u | grep -i '[0-9]\.c$' |
340.7 ± 4.5 | 335.3 | 347.1 | 1.23 ± 0.05 |
Relevant stuff: http://doc.cat-v.org/unix/find-history
Thanks for this link, very cool!
:-)
The pipe shouldn't be a bottleneck. The bottleneck is usually I/O.
This is often true, but in my personal uses the whole tree is usually in cache (dcache, or at least buffer cache). Most of the overhead then comes from kernel and syscall overhead.
You can see that piping all filenames only adds a little bit of time to a simple find(1).
That benchmark still has
find
writing all the paths to stdout. This one shows the difference: Command Mean [s] Min [s] Max [s] Relativefind -false
2.928 ± 0.023 2.892 2.953 1.00find >/dev/null
3.076 ± 0.010 3.059 3.095 1.05 ± 0.01find | cat >/dev/null
3.129 ± 0.020 3.099 3.169 1.07 ± 0.01
A 7% lost on the pipe seems quite reasonable. How much does fdfind(1) take on, say, fdfind
vs fdfind .
? That is, how much % of the time does fdfind take in filtering the output by name (but not in the regex itself)? (My own tests seem to say it's negligible.)
Anyway, more important than things like
-name
are things like-prune
,-xdev
, etc. that reduce the search space in a way that post-processing cannot.
Yup, those things must run under find(1), but they are actually fast, aren't they? The problem with find(1)'s performance, AFAIK, is just with name filtering. Do you have benchmarks for those things comparing to find(1)?
Also, not only the real time is important. fdfind(1), with the appropriate patches and optimizations may beat
find | grep
by a slight advantage in real time, as you show, but it still uses heavy CPU time; it's just that it runs in parallel. That's fine if the bottleneck is in find(1), but if I pipe this command to a more consuming pipeline where find(1) is not the bottleneck, I fear that it may actually be slower, and will occupy the CPU that could be running other tasks.Agreed.
fd
tends to use a bit too much parallelism by default (see #1203). Butfd -j1
performance is also pretty good: Command Mean [ms] Min [ms] Max [ms] Relativefind -iregex '.*[0-9]\.c$'
667.6 ± 12.7 655.4 685.7 2.42 ± 0.11find -iname '*[0-9].c'
472.0 ± 12.3 460.5 488.1 1.71 ± 0.08find | grep -i '[0-9]\.c$'
312.6 ± 7.8 291.0 317.2 1.13 ± 0.05fd -j1 -u '[0-9]\.c$'
276.1 ± 10.9 261.8 287.1 1.00fd -j1 -u | grep -i '[0-9]\.c$'
340.7 ± 4.5 335.3 347.1 1.23 ± 0.05
Interesting; -j1
is good.
find is running faster than fd just now in the example of hyperfind
@jgardona find
completing in 2.9 ms means that's probably a very small directory tree. I'd imagine that most of the time taken by fd
is startup overhead, maybe due to #1203
@jgardona
find
completing in 2.9 ms means that's probably a very small directory tree. I'd imagine that most of the time taken byfd
is startup overhead, maybe due to #1203
tested it on /usr/bin directory
Unless you have an incredibly lare number of executables installed, /usr/bin won't be very large, relatively speaking. It is also pretty flat, which I think hinders the parallelizability.
Well, its near 3.000 executables.
@jgardona
That's a small thing, actually. Compare to this:
alx@debian:~$ time find ~ | wc -l
660487
real 0m0.338s
user 0m0.080s
sys 0m0.273s
alx@debian:~$ time find ~ -type f | wc -l
603577
real 0m0.344s
user 0m0.096s
sys 0m0.262s
alx@debian:~$ time find ~ -type d | wc -l
53914
real 0m0.314s
user 0m0.091s
sys 0m0.223s
:)