Control output order
I was looking for an alternative to gfind to walk a large tree, I love the performance of fd .
For repeated searches, there is value in having a consistent output order. I can pipe the output of fd to gsort to achieve this, but this means I need to wait for the entire traversal until the first item is printed or can be processed by subsequent steps in the pipeline.
I'd like fd -O to emit the search results ordered according to the current locale, while still emitting output as it is produced. A simple implementation could look like this:
- All jobs write their output to a local priority queue, sorted by locale
- Repeat until all jobs have terminated
- Wait until all jobs either have at least one item in their PQ or have terminated
- Find the PQ with the globally smallest item
- Pop and print the item from that PQ
This assumes that, once an item is in the PQ, no smaller item is going to be added there. This might require finer control at the job level.
GNU parallel has this on by default. How do the do it?
Unlike #1305, this is not asking to control the order of the traversal, just the output order.
Question: do you need the whole output to be sorted? Or do you just need some deterministic order that stays the same between multiple runs? Because the latter can be done way more efficiently (just sort the entries within each directory)
Any deterministic order is better than randomness. Sorted output seems easier to grasp and feels more consistent with ls, though.
From your question, I'm guessing that fd implements a producer-consumer pattern where directories are produced (starting with the search root(s)) and each job produces more directories and also emits output. If this is not too far from the truth, we might achieve true sorting by adding dependencies between jobs that must be satisfied before output is generated.
I think it would be possible to do something like that, but it would add a lot of complexity, and doing it unconditionally could hurt performance in cases where you don't care about order, and I'm not sure how to implement it in a way that doesn't impact the unordered case without having completely different flows for ordered vs unordered (unless we buffered all results before printing).
This is also made more difficult by the design of the ignore crate we use. It doesn't currently have any way to know when all children of a directory have been processed[1], and indeed since the children can in turn be processed on multiple threads, implementing that would be somewhat difficult (but not impossible).
Some other notes:
Currently, we use a channel to send items from the worker threads to the output thread. So switching to a priority queue would probably introduce more lock contention. Possibly we could use some solution that involves a separate channel for each directory, but I'm not sure quite how that would work.
As currently implemented, a "job" is just a single directory entry. And if it is a directory additional jobs are created, and work stealing is used to distribute the jobs across threads. This means that "wait for all jobs" is the same as waiting for everything.
[1]: I have considered possibly forking ignore, but don't currently have the time and motivation to put into that.
Thank you for the thoughtful response.
My PQ idea was naive. If there is already a single output thread, things look simpler to me. I'm sure I'm overlooking details.
Assuming that all workers send items to the output thread and each item is a directory to be printed. What if each item had dependencies -- jobs that must be done before an item can be printed?
Would it help to discuss based on a simplified implementation? We could assess the overhead of sorting in terms of both complexity and time cost.
I would love to allow optional "complete and sort by parameter" mode - where fd buffers the found results, sorts them by name/size/extension/..., and prints only afterwards. I love the fd's performance, and wouldn't want to jeopardize it, but I suspect doing this as a last step before printing (i.e. sending to a buffer rather than output), and sorting once complete might bypass any performance constraints. Do note that printing and buffering are fairly similar performance-wise because they both contend for some shared resource (stdout vs buffer). Plus it should be possible to buffer the output of each thread independently as it is no longer needed to print things as soon as they are found.
As a workaround, for now I have to do this to sort by the file size. Note that I cannot even use fd's --list-details because it shows sizes in the human-readable 15K and sort doesn't understand it.
fd ...search...params... -x ls -al | sort -k5,5n