fd icon indicating copy to clipboard operation
fd copied to clipboard

Allow `--exec-batch` to be used in parallel

Open Architector4 opened this issue 4 months ago • 6 comments

The --exec-batch parameter runs the specified command multiple times over the biggest possible batches of file names. However, if a lot of files are to be found, and this parameter is used, it very easily becomes the bottleneck.

Batches always (appear to) run sequentially, ignoring the --threads parameter. If the command takes a long time (which tends to happen when you throw 10000 filenames at a singular program lol), then every invocation just sits there until it is done, maxing out a single CPU core while 15 other cores and fd itself lay dormant.

It'd be nice to be able to get parallelism benefits of --exec with batching benefits of --exec-batch.


My particular use case right now is that I'm bored and I want to scan a lot of files with ClamAV. Its clamscan tool appears to run through files completely sequentially too, running on one CPU, and also always takes a few seconds on startup just to load its databases. Upon running, it seems to consume at most ~1.5GiB of RAM. I have 64GiB RAM and 16 cores, so it seems to be well within budget to run it in parallel on my machine.

In particular, I'm running this:

sudo fd --type f --exec-batch clamscan -i --no-summary

As of right now, it seems my only choices is to either have only one core used, or write some weird script that would batch stuff up and dispatch instances of clamscan myself (I'm too lazy for that though lol), or use --exec instead and have all cores fully used at cost of ~10000x the "loading database" overhead, since there would be one invocation per one file.

I hope I'm not missing something lol

Architector4 avatar Jul 31 '25 22:07 Architector4

That seems reasonable. Although that might require buffering the output in order to avoid confusing console output.

There are also likely cases where the user would want the batches to be run in sequence, but I worry about adding yet another option.

Honestly, for your use case the best way to do it would probably be to have fd output the list to a stream, then have something else handle the batching and parallel clamav calls, so you can control the sizes of the batches, instead of depending on how many fit in a single command line invocation.

tmccombs avatar Aug 01 '25 17:08 tmccombs

Thanks for the response!

Although that might require buffering the output in order to avoid confusing console output.

There are also likely cases where the user would want the batches to be run in sequence, but I worry about adding yet another option.

How does normal --exec handle these questions? I feel like the same answers should apply consistently for both. For the latter, I imagine using the value of --threads seems like a good idea, same as what --threads does. (I assume it does a bit more complex logic than what I imply, but point stands)

instead of depending on how many fit in a single command line invocation.

Honestly, I just want that. Scan all files, but minimize the amount of clamscan invocations to reduce the total initialization overhead time lol

Architector4 avatar Aug 01 '25 18:08 Architector4

How does normal --exec handle these questions?

It saves the output to a buffer, then prints that output in series. For --exec-batch that is less ideal because a single invocation could take a long time, and could produce a lot of output, and you have to wait until the process finishes before printing any output.

Honestly, I just want that. Scan all files, but minimize the amount of clamscan invocations to reduce the total initialization overhead time lol

Well, in this case, you could have a single invocation of clamscan by piping to clamscan - so it reads the list of files from stdin.

tmccombs avatar Aug 03 '25 02:08 tmccombs

Well, in this case, you could have a single invocation of clamscan by piping to clamscan - so it reads the list of files from stdin.

Just checked, that just treats input as file data to scan itself, and quickly finishes with "Scanned files: 1", "Data read: 456.96 MB", and no results lol

It saves the output to a buffer, then prints that output in series. For --exec-batch that is less ideal because a single invocation could take a long time, and could produce a lot of output, and you have to wait until the process finishes before printing any output.

Personally I feel like that's a lot of "could"s. Though, I understand wanting to protect the user from failing to consider this and running out of RAM.

Might it work to make it such when --exec-batch is used, and --threads is explicitly provided, to only then enable threading for --exec-batch? That, and explicitly denoting this behavior and the risks of it in the docs, of course.

Architector4 avatar Aug 03 '25 15:08 Architector4

Just checked, that just treats input as file data to scan itself, and quickly finishes

Sorry, it should be clamscan -f -

Personally I feel like that's a lot of "could"s.

Ok, let me give a couple of more concrete examples.

Suppose you use exec-batch with grep, or a linter, etc. You would probably like to start getting results back immediately, instead of having to wait until all (or a first batch) files have been processed.

Or say you use --exec-batch to open all the files in an editor like vim or emacs. Or a program that requires interactive input. Then buffering won't work at all, because the program needs input before it can complete.

tmccombs avatar Aug 04 '25 05:08 tmccombs

Sorry, it should be clamscan -f -

I appreciate the suggestions! I guess this reduces the overhead of database compiling to only once at all, but then sounds like it will run into issues if a filename has a newline in it, and runs on a single thread. (also, instead of - you have to use /dev/stdin)

Ok, let me give a couple of more concrete examples.

I mean, I agree, single-threaded non-buffering batching behavior is very good to have and should be kept by default. I just think that having the option of buffering and doing it in parallel would be good to have also, for different workloads.

I apologize if I seemed rude; my point with "a lot of "could"s" was that this word admits that workloads otherwise to your suggestions (i.e. that take little time, or produce little output, or the user is aware of and expecting the buffering behavior) also exist. I only propose providing support for them with --exec-batch or something too.

Architector4 avatar Aug 04 '25 18:08 Architector4