fd icon indicating copy to clipboard operation
fd copied to clipboard

Benchmark is unfair. find(1) should not be used for grepping.

Open alejandro-colomar opened this issue 9 months ago • 10 comments

Hi,

I believe the benchmarks you provide compared to find(1) are unfair. find(1) should not be used for grepping; following the Unix principles, find(1) should just find, and grep(1) should be responsible for filtering the output of find(1).

If we pipe find(1) to grep(1), the performance is significantly faster than just using find(1):

$ time find ~ -iregex '.*[0-9]\.c$' >/dev/null

real	0m1.103s
user	0m0.881s
sys	0m0.220s
$ time find ~ | grep -i '/[^/]*[0-9]\.c$' >/dev/null

real	0m0.346s
user	0m0.103s
sys	0m0.260s

find | grep seems to be faster than fdfind in my own simple test:

$ time fdfind -HI '.*[0-9]\.c$' ~ >/dev/null

real	0m0.383s
user	0m1.920s
sys	0m6.126s

Can you please provide benchmarks against this pipeline in your readme?

What version of fd are you using? [paste the output of fd --version here]

$ dpkg -l | grep fd-find
ii  fd-find                               8.7.0-3+b1                               amd64        Simple, fast and user-friendly alternative to find
$ fdfind --version
fdfind 8.7.0

Just for completeness, here's my CPU:

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  24
  On-line CPU(s) list:   0-23
Vendor ID:               GenuineIntel
  Model name:            13th Gen Intel(R) Core(TM) i9-13900T
    CPU family:          6
    Model:               183
    Thread(s) per core:  1
    Core(s) per socket:  24
...
Caches (sum of all):     
  L1d:                   896 KiB (24 instances)
  L1i:                   1.3 MiB (24 instances)
  L2:                    32 MiB (12 instances)
  L3:                    36 MiB (1 instance)

Thanks!

alejandro-colomar avatar Oct 13 '23 12:10 alejandro-colomar

I believe the benchmarks you provide compared to find(1) are unfair. find(1) should not be used for grepping; following the Unix principles, find(1) should just find, and grep(1) should be responsible for filtering the output of find(1).

find has supported tests like -name since antiquity, so I think it's an exaggeration to say they "should not" be used. And in the limit, find | grep foo has to send arbitrarily more data through a pipe than find -name foo outputs, so it has the potential to be much faster. (There are other reasons to prefer find -name too, such as the fact that find | grep is broken for filenames that contain newlines.)

Unfortunately, the regex implementation in GNU find (which comes from Gnulib) is quite slow, so you can easily beat it by using the optimized one from grep. Even just -iname is faster than -iregex, though not quite as fast as grep:

Command Mean [ms] Min [ms] Max [ms] Relative
find -iregex '.*[0-9]\.c$' 671.9 ± 12.0 660.0 686.9 2.14 ± 0.04
find -iname '*[0-9].c' 475.2 ± 13.4 463.2 492.9 1.51 ± 0.05
find | grep -i '[0-9]\.c$' 314.0 ± 3.2 309.9 318.7 1.00
fd -u '[0-9]\.c$' 333.2 ± 3.5 329.3 341.0 1.06 ± 0.02
fd -u | grep -i '[0-9]\.c$' 337.2 ± 3.2 332.9 342.9 1.07 ± 0.02

This is on a checkout of rust-lang/rust (see the bfs benchmarks). A build of fd that includes https://github.com/BurntSushi/ripgrep/commit/d938e955af7a39ac0245b1ce84924f7f4a2141ac does much better:

Command Mean [ms] Min [ms] Max [ms] Relative
find -iregex '.*[0-9]\.c$' 663.9 ± 13.0 654.1 688.6 4.55 ± 0.54
find -iname '*[0-9].c' 471.1 ± 13.9 459.8 489.6 3.23 ± 0.39
find | grep -i '[0-9]\.c$' 306.4 ± 11.8 289.5 320.0 2.10 ± 0.26
fd -u '[0-9]\.c$' 145.8 ± 17.2 123.7 174.2 1.00
fd -u | grep -i '[0-9]\.c$' 210.7 ± 11.6 198.5 238.9 1.45 ± 0.19

I agree that the benchmarks should probably be updated, and we should make the comparisons as fair as possible. #893 is relevant.

tavianator avatar Oct 13 '23 13:10 tavianator

I believe the benchmarks you provide compared to find(1) are unfair. find(1) should not be used for grepping; following the Unix principles, find(1) should just find, and grep(1) should be responsible for filtering the output of find(1).

find has supported tests like -name since antiquity, so I think it's an exaggeration to say they "should not" be used.

Relevant stuff: http://doc.cat-v.org/unix/find-history

I understand that people use them out of habit, but the design of the program is rather dubious. But yeah, I could change the "should not be used" to maybe "are suboptimal, both in performance terms, and in simplicity".

Piping to grep(1), you get faster results, and you don't even need to check the find(1) manual page for all the similar but different options that it has for grepping files.

And in the limit, find | grep foo has to send arbitrarily more data through a pipe than find -name foo outputs, so it has the potential to be much faster.

The pipe shouldn't be a bottleneck. The bottleneck is usually I/O.

$ time find >/dev/null

real	0m0.327s
user	0m0.091s
sys	0m0.233s
$ time find | grep -i '[0-9]\.c$' >/dev/null

real	0m0.335s
user	0m0.098s
sys	0m0.250s

You can see that piping all filenames only adds a little bit of time to a simple find(1).

Also, not only the real time is important. fdfind(1), with the appropriate patches and optimizations may beat find | grep by a slight advantage in real time, as you show, but it still uses heavy CPU time; it's just that it runs in parallel.

alx@debian:~$ time find | grep -i '[0-9]\.c$' >/dev/null

real	0m0.358s
user	0m0.098s
sys	0m0.276s
alx@debian:~$ time fdfind -u | grep -i '[0-9]\.c$' >/dev/null

real	0m0.390s
user	0m1.831s
sys	0m6.067s
alx@debian:~$ time fdfind -u '[0-9]\.c$' >/dev/null

real	0m0.385s
user	0m1.846s
sys	0m6.222s

That's fine if the bottleneck is in find(1), but if I pipe this command to a more consuming pipeline where find(1) is not the bottleneck, I fear that it may actually be slower, and will occupy the CPU that could be running other tasks.

If I have other heavy tasks at the same time, like compiling some software, it will also probably affect the performance of fdfind(1) significantly, while find | grep wouldn't be affected as much, probably.

(There are other reasons to prefer find -name too, such as the fact that find | grep is broken for filenames that contain newlines.)

Those filenames are already broken (they are unportable, according to POSIX https://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap04.html#tag_04_06). Please don't use them. :)

If you need them, though, you can use find -print0 | grep -z.

Unfortunately, the regex implementation in GNU find (which comes from Gnulib) is quite slow, so you can easily beat it by using the optimized one from grep. Even just -iname is faster than -iregex, though not quite as fast as grep: Command Mean [ms] Min [ms] Max [ms] Relative find -iregex '.*[0-9]\.c$' 671.9 ± 12.0 660.0 686.9 2.14 ± 0.04 find -iname '*[0-9].c' 475.2 ± 13.4 463.2 492.9 1.51 ± 0.05 find | grep -i '[0-9]\.c$' 314.0 ± 3.2 309.9 318.7 1.00 fd -u '[0-9]\.c$' 333.2 ± 3.5 329.3 341.0 1.06 ± 0.02 fd -u | grep -i '[0-9]\.c$' 337.2 ± 3.2 332.9 342.9 1.07 ± 0.02

This is on a checkout of rust-lang/rust (see the bfs benchmarks). A build of fd that includes BurntSushi/ripgrep@d938e95 does much better: Command Mean [ms] Min [ms] Max [ms] Relative find -iregex '.*[0-9]\.c$' 663.9 ± 13.0 654.1 688.6 4.55 ± 0.54 find -iname '*[0-9].c' 471.1 ± 13.9 459.8 489.6 3.23 ± 0.39 find | grep -i '[0-9]\.c$' 306.4 ± 11.8 289.5 320.0 2.10 ± 0.26 fd -u '[0-9]\.c$' 145.8 ± 17.2 123.7 174.2 1.00 fd -u | grep -i '[0-9]\.c$' 210.7 ± 11.6 198.5 238.9 1.45 ± 0.19

I agree that the benchmarks should probably be updated, and we should make the comparisons as fair as possible. #893 is relevant.

Thanks.

alejandro-colomar avatar Oct 13 '23 14:10 alejandro-colomar

Relevant stuff: http://doc.cat-v.org/unix/find-history

Thanks for this link, very cool!

The pipe shouldn't be a bottleneck. The bottleneck is usually I/O.

This is often true, but in my personal uses the whole tree is usually in cache (dcache, or at least buffer cache). Most of the overhead then comes from kernel and syscall overhead.

You can see that piping all filenames only adds a little bit of time to a simple find(1).

That benchmark still has find writing all the paths to stdout. This one shows the difference:

Command Mean [s] Min [s] Max [s] Relative
find -false 2.928 ± 0.023 2.892 2.953 1.00
find >/dev/null 3.076 ± 0.010 3.059 3.095 1.05 ± 0.01
find | cat >/dev/null 3.129 ± 0.020 3.099 3.169 1.07 ± 0.01

Anyway, more important than things like -name are things like -prune, -xdev, etc. that reduce the search space in a way that post-processing cannot.

Also, not only the real time is important. fdfind(1), with the appropriate patches and optimizations may beat find | grep by a slight advantage in real time, as you show, but it still uses heavy CPU time; it's just that it runs in parallel.

That's fine if the bottleneck is in find(1), but if I pipe this command to a more consuming pipeline where find(1) is not the bottleneck, I fear that it may actually be slower, and will occupy the CPU that could be running other tasks.

Agreed. fd tends to use a bit too much parallelism by default (see https://github.com/sharkdp/fd/issues/1203). But fd -j1 performance is also pretty good:

Command Mean [ms] Min [ms] Max [ms] Relative
find -iregex '.*[0-9]\.c$' 667.6 ± 12.7 655.4 685.7 2.42 ± 0.11
find -iname '*[0-9].c' 472.0 ± 12.3 460.5 488.1 1.71 ± 0.08
find | grep -i '[0-9]\.c$' 312.6 ± 7.8 291.0 317.2 1.13 ± 0.05
fd -j1 -u '[0-9]\.c$' 276.1 ± 10.9 261.8 287.1 1.00
fd -j1 -u | grep -i '[0-9]\.c$' 340.7 ± 4.5 335.3 347.1 1.23 ± 0.05

tavianator avatar Oct 13 '23 15:10 tavianator

Relevant stuff: http://doc.cat-v.org/unix/find-history

Thanks for this link, very cool!

:-)

The pipe shouldn't be a bottleneck. The bottleneck is usually I/O.

This is often true, but in my personal uses the whole tree is usually in cache (dcache, or at least buffer cache). Most of the overhead then comes from kernel and syscall overhead.

You can see that piping all filenames only adds a little bit of time to a simple find(1).

That benchmark still has find writing all the paths to stdout. This one shows the difference: Command Mean [s] Min [s] Max [s] Relative find -false 2.928 ± 0.023 2.892 2.953 1.00 find >/dev/null 3.076 ± 0.010 3.059 3.095 1.05 ± 0.01 find | cat >/dev/null 3.129 ± 0.020 3.099 3.169 1.07 ± 0.01

A 7% lost on the pipe seems quite reasonable. How much does fdfind(1) take on, say, fdfind vs fdfind .? That is, how much % of the time does fdfind take in filtering the output by name (but not in the regex itself)? (My own tests seem to say it's negligible.)

Anyway, more important than things like -name are things like -prune, -xdev, etc. that reduce the search space in a way that post-processing cannot.

Yup, those things must run under find(1), but they are actually fast, aren't they? The problem with find(1)'s performance, AFAIK, is just with name filtering. Do you have benchmarks for those things comparing to find(1)?

Also, not only the real time is important. fdfind(1), with the appropriate patches and optimizations may beat find | grep by a slight advantage in real time, as you show, but it still uses heavy CPU time; it's just that it runs in parallel. That's fine if the bottleneck is in find(1), but if I pipe this command to a more consuming pipeline where find(1) is not the bottleneck, I fear that it may actually be slower, and will occupy the CPU that could be running other tasks.

Agreed. fd tends to use a bit too much parallelism by default (see #1203). But fd -j1 performance is also pretty good: Command Mean [ms] Min [ms] Max [ms] Relative find -iregex '.*[0-9]\.c$' 667.6 ± 12.7 655.4 685.7 2.42 ± 0.11 find -iname '*[0-9].c' 472.0 ± 12.3 460.5 488.1 1.71 ± 0.08 find | grep -i '[0-9]\.c$' 312.6 ± 7.8 291.0 317.2 1.13 ± 0.05 fd -j1 -u '[0-9]\.c$' 276.1 ± 10.9 261.8 287.1 1.00 fd -j1 -u | grep -i '[0-9]\.c$' 340.7 ± 4.5 335.3 347.1 1.23 ± 0.05

Interesting; -j1 is good.

alejandro-colomar avatar Oct 13 '23 22:10 alejandro-colomar

find is running faster than fd just now in the example of hyperfind image

jgardona avatar Oct 14 '23 12:10 jgardona

@jgardona find completing in 2.9 ms means that's probably a very small directory tree. I'd imagine that most of the time taken by fd is startup overhead, maybe due to #1203

tavianator avatar Oct 14 '23 15:10 tavianator

@jgardona find completing in 2.9 ms means that's probably a very small directory tree. I'd imagine that most of the time taken by fd is startup overhead, maybe due to #1203

tested it on /usr/bin directory

image

jgardona avatar Oct 14 '23 17:10 jgardona

Unless you have an incredibly lare number of executables installed, /usr/bin won't be very large, relatively speaking. It is also pretty flat, which I think hinders the parallelizability.

tmccombs avatar Oct 14 '23 23:10 tmccombs

Well, its near 3.000 executables.

jgardona avatar Oct 15 '23 02:10 jgardona

@jgardona

That's a small thing, actually. Compare to this:

alx@debian:~$ time find ~ | wc -l
660487

real	0m0.338s
user	0m0.080s
sys	0m0.273s
alx@debian:~$ time find ~ -type f | wc -l
603577

real	0m0.344s
user	0m0.096s
sys	0m0.262s
alx@debian:~$ time find ~ -type d | wc -l
53914

real	0m0.314s
user	0m0.091s
sys	0m0.223s

:)

alejandro-colomar avatar Oct 15 '23 10:10 alejandro-colomar