Allow `--exec-batch` to be used in parallel
The --exec-batch parameter runs the specified command multiple times over the biggest possible batches of file names. However, if a lot of files are to be found, and this parameter is used, it very easily becomes the bottleneck.
Batches always (appear to) run sequentially, ignoring the --threads parameter. If the command takes a long time (which tends to happen when you throw 10000 filenames at a singular program lol), then every invocation just sits there until it is done, maxing out a single CPU core while 15 other cores and fd itself lay dormant.
It'd be nice to be able to get parallelism benefits of --exec with batching benefits of --exec-batch.
My particular use case right now is that I'm bored and I want to scan a lot of files with ClamAV. Its clamscan tool appears to run through files completely sequentially too, running on one CPU, and also always takes a few seconds on startup just to load its databases. Upon running, it seems to consume at most ~1.5GiB of RAM. I have 64GiB RAM and 16 cores, so it seems to be well within budget to run it in parallel on my machine.
In particular, I'm running this:
sudo fd --type f --exec-batch clamscan -i --no-summary
As of right now, it seems my only choices is to either have only one core used, or write some weird script that would batch stuff up and dispatch instances of clamscan myself (I'm too lazy for that though lol), or use --exec instead and have all cores fully used at cost of ~10000x the "loading database" overhead, since there would be one invocation per one file.
I hope I'm not missing something lol
That seems reasonable. Although that might require buffering the output in order to avoid confusing console output.
There are also likely cases where the user would want the batches to be run in sequence, but I worry about adding yet another option.
Honestly, for your use case the best way to do it would probably be to have fd output the list to a stream, then have something else handle the batching and parallel clamav calls, so you can control the sizes of the batches, instead of depending on how many fit in a single command line invocation.
Thanks for the response!
Although that might require buffering the output in order to avoid confusing console output.
There are also likely cases where the user would want the batches to be run in sequence, but I worry about adding yet another option.
How does normal --exec handle these questions? I feel like the same answers should apply consistently for both. For the latter, I imagine using the value of --threads seems like a good idea, same as what --threads does. (I assume it does a bit more complex logic than what I imply, but point stands)
instead of depending on how many fit in a single command line invocation.
Honestly, I just want that. Scan all files, but minimize the amount of clamscan invocations to reduce the total initialization overhead time lol
How does normal --exec handle these questions?
It saves the output to a buffer, then prints that output in series. For --exec-batch that is less ideal because a single invocation could take a long time, and could produce a lot of output, and you have to wait until the process finishes before printing any output.
Honestly, I just want that. Scan all files, but minimize the amount of clamscan invocations to reduce the total initialization overhead time lol
Well, in this case, you could have a single invocation of clamscan by piping to clamscan - so it reads the list of files from stdin.
Well, in this case, you could have a single invocation of clamscan by piping to
clamscan -so it reads the list of files from stdin.
Just checked, that just treats input as file data to scan itself, and quickly finishes with "Scanned files: 1", "Data read: 456.96 MB", and no results lol
It saves the output to a buffer, then prints that output in series. For --exec-batch that is less ideal because a single invocation could take a long time, and could produce a lot of output, and you have to wait until the process finishes before printing any output.
Personally I feel like that's a lot of "could"s. Though, I understand wanting to protect the user from failing to consider this and running out of RAM.
Might it work to make it such when --exec-batch is used, and --threads is explicitly provided, to only then enable threading for --exec-batch? That, and explicitly denoting this behavior and the risks of it in the docs, of course.
Just checked, that just treats input as file data to scan itself, and quickly finishes
Sorry, it should be clamscan -f -
Personally I feel like that's a lot of "could"s.
Ok, let me give a couple of more concrete examples.
Suppose you use exec-batch with grep, or a linter, etc. You would probably like to start getting results back immediately, instead of having to wait until all (or a first batch) files have been processed.
Or say you use --exec-batch to open all the files in an editor like vim or emacs. Or a program that requires interactive input. Then buffering won't work at all, because the program needs input before it can complete.
Sorry, it should be
clamscan -f -
I appreciate the suggestions! I guess this reduces the overhead of database compiling to only once at all, but then sounds like it will run into issues if a filename has a newline in it, and runs on a single thread. (also, instead of - you have to use /dev/stdin)
Ok, let me give a couple of more concrete examples.
I mean, I agree, single-threaded non-buffering batching behavior is very good to have and should be kept by default. I just think that having the option of buffering and doing it in parallel would be good to have also, for different workloads.
I apologize if I seemed rude; my point with "a lot of "could"s" was that this word admits that workloads otherwise to your suggestions (i.e. that take little time, or produce little output, or the user is aware of and expecting the buffering behavior) also exist. I only propose providing support for them with --exec-batch or something too.