coreutils
coreutils copied to clipboard
Concurrent disk I/O
Commands that perform a lot of disk I/O (cp
, mv
, rm
, etc.), especially when operating on SSDs[^1] but potentially also on slower disks[^2], are likely to complete much faster if they issue concurrent I/O requests (ideally asynchronously and in bulk[^3], but synchronously via a thread pool could also work).
Examples of cases where this could be very beneficial include:
-
cp -r <source> <dest>
where<source>
contains large number of files/subdirectories -
cp <sources> <dest>
where<sources>
is a large number of files -
rm -r <source>
where<source>
contains large number of files/subdirectories -
rm <sources>
where<sources>
is a large number of files -
mv <sources> <dest>
where<sources>
is a large number of files -
ln <sources> <dest>
where<sources>
is a large number of files -
ls <path>
andls -r <path>
where<path>
contains large number of files/subdirectories -
du <path>
where<path>
contains large number of files/subdirectories
(and so on, the list above is not exhaustive: examples include stat
, readlink
, touch
, chown
, chgrp
, chmod
, etc.)
Whether this behavior should be opt-out/default-off or opt-in/default-on, I lean towards opt-out but I must admit I am not sure. With ever-increasing core counts and ever-higher capacity disks, I would argue there is a pretty strong case for the behavior to be opt-out rather than opt-in. At the same time, there may be concerns about these tools suddenly generating large I/O spikes, in which case it should be opt-in. In all cases, the proposed behavior should never affect the correctness of the command, or its semantics (e.g. ordering of results, behavior on error, etc.).
Currently, the closest that can be achieved with standard tools is running processes in parallel by using something like xargs -P<n>
, but this is somewhat easy to get wrong (e.g. forgetting to use -0
, dynamically choosing between -I{}
and normal mode based on how many files we need to operate on and how long their names are, etc) and it does not, in general, maintain the standard semantics of the tools (e.g. ordering needs to be done separately, handling errors is left to the user, etc.)
[^1]: as they tend to deliver higher throughput when operating with high queue depths [^2]: as the OS can potentially coalesce a greater number of logical writes into a smaller number of disk write operations [^3]: especially if via io_uring or similar mechanisms that can guarantee that in practice file operations never block
Interesting ideas! This should be something that Rust excels at over C too, with more thread safety and convenience.There are two things I think we should keep in mind here.
- I think this should not complicate or bloat the code too much.
- Part of compatibility with GNU is doing operations in the same order. For instance, if you do 2 operations in order and you exit if the first one fails, the second operation never gets executed. If you make those two operations concurrent, they will both have executed. In cases where this is an issue, we might need some flag or something that specifies that the order is random.
@thecotne tried with findutils: https://github.com/uutils/findutils/pull/142 and it didn't provide much ROI
@sylvestre I can acknowledge ROI will depend on the OS, the filesystem in use[^1], the underlying hardware, the implementation[^2], the operation(s) being performed, the configuration in use, whether the data is already in the page cache or not, and so on... but I have definitely seen it improve performance multiple times in the past.
Case in point, I just ran this very synthetic benchmark on my laptop (a recent-ish macbook) and even my utterly unrefined bash "implementation" does improve throughput by 1.5x~2x:
$ N=1000000
$ P=100
$ time (seq 1 $N | xargs touch)
real 2m17.307s
user 0m2.254s
sys 1m31.966s
$ ls -1 | wc -l
1000000
$ time (find . -type f | xargs rm)
real 4m28.594s
user 0m2.096s
sys 2m3.315s
$ ls -1 | wc -l
0
$ time (seq 1 $N | xargs -P$P touch)
real 2m2.976s
user 0m4.890s
sys 9m55.269s
$ ls -1 | wc -l
1000000
$ time (find . -type f | xargs -P$P rm)
real 3m9.421s
user 0m3.746s
sys 5m55.275s
$ ls -1 | wc -l
0
$ N=100000
$ P=100
$ time (seq 1 $N | xargs touch)
real 0m9.580s
user 0m0.200s
sys 0m8.054s
$ ls -1 | wc -l
100000
$ time (find . -type f | xargs rm)
real 0m18.979s
user 0m0.192s
sys 0m9.386s
$ ls -1 | wc -l
0
$ time (seq 1 $N | xargs -P$P touch)
real 0m6.558s
user 0m0.340s
sys 0m31.672s
$ ls -1 | wc -l
100000
$ time (find . -type f | xargs -P$P rm)
real 0m15.399s
user 0m0.274s
sys 0m12.768s
$ ls -1 | wc -l
0
$ N=10000
$ P=100
$ time (seq 1 $N | xargs -I{} touch {})
real 0m31.932s
user 0m4.369s
sys 0m14.138s
$ ls -1 | wc -l
10000
$ time (find . -type f | xargs -I{} rm {})
real 0m37.942s
user 0m5.437s
sys 0m16.991s
$ ls -1 | wc -l
0
$ time (seq 1 $N | xargs -P$P -I{} touch {})
real 0m13.282s
user 0m4.663s
sys 0m21.291s
$ ls -1 | wc -l
10000
$ time (find . -type f | xargs -P$P -I{} rm {})
real 0m14.835s
user 0m6.580s
sys 0m28.038s
$ ls -1 | wc -l
0
I don't have access to it anymore, so I can't run experiments, but in the past being able to run large deletes concurrently on a gluster volume cut down wall time from many hours to few tens of minutes.
[^1]: Some filesystems are known to have not-so-great performance behavior under high concurrency, although one may argue it is also because existing tools are not great at generating concurrent I/O (case in point) and also because in the past there was definitely lower headroom available to run I/O concurrently.
[^2]: Probably worth pointing out that IIUC in https://github.com/uutils/findutils/pull/142 the loop in process_dir
seems to not actually be executed in parallel: could this at least partially explain why that PR did not yield significant improvements? Furthermore, find
is a read-only operation, whereas most of the ones listed above are read-write.
@sylvestre I can acknowledge ROI will depend on the OS, the filesystem in use1, the underlying hardware, the implementation2, the operation(s) being performed, the configuration in use, whether the data is already in the page cache or not, and so on... but I have definitely seen it improve performance multiple times in the past.
Case in point, I just ran this very synthetic benchmark on my laptop (a recent-ish macbook) and even my utterly unrefined bash "implementation" does improve throughput by 1.5x~2x:
$ N=1000000 $ P=100 $ time (seq 1 $N | xargs touch) real 2m17.307s user 0m2.254s sys 1m31.966s $ ls -1 | wc -l 1000000 $ time (find . -type f | xargs rm) real 4m28.594s user 0m2.096s sys 2m3.315s $ ls -1 | wc -l 0 $ time (seq 1 $N | xargs -P$P touch) real 2m2.976s user 0m4.890s sys 9m55.269s $ ls -1 | wc -l 1000000 $ time (find . -type f | xargs -P$P rm) real 3m9.421s user 0m3.746s sys 5m55.275s $ ls -1 | wc -l 0 $ N=100000 $ P=100 $ time (seq 1 $N | xargs touch) real 0m9.580s user 0m0.200s sys 0m8.054s $ ls -1 | wc -l 100000 $ time (find . -type f | xargs rm) real 0m18.979s user 0m0.192s sys 0m9.386s $ ls -1 | wc -l 0 $ time (seq 1 $N | xargs -P$P touch) real 0m6.558s user 0m0.340s sys 0m31.672s $ ls -1 | wc -l 100000 $ time (find . -type f | xargs -P$P rm) real 0m15.399s user 0m0.274s sys 0m12.768s $ ls -1 | wc -l 0 $ N=10000 $ P=100 $ time (seq 1 $N | xargs -I{} touch {}) real 0m31.932s user 0m4.369s sys 0m14.138s $ ls -1 | wc -l 10000 $ time (find . -type f | xargs -I{} rm {}) real 0m37.942s user 0m5.437s sys 0m16.991s $ ls -1 | wc -l 0 $ time (seq 1 $N | xargs -P$P -I{} touch {}) real 0m13.282s user 0m4.663s sys 0m21.291s $ ls -1 | wc -l 10000 $ time (find . -type f | xargs -P$P -I{} rm {}) real 0m14.835s user 0m6.580s sys 0m28.038s $ ls -1 | wc -l 0
I don't have access to it anymore, so I can't run experiments, but in the past being able to run large deletes concurrently on a gluster volume cut down wall time from many hours to few tens of minutes.
Footnotes
- Some filesystems are known to have not-so-great performance behavior under high concurrency, although one may argue it is also because existing tools are not great at generating concurrent I/O (case in point) and also because in the past there was definitely lower headroom available to run I/O concurrently. ↩
- Probably worth pointing out that IIUC in https://github.com/uutils/findutils/pull/142 the loop in
process_dir
seems to not actually be executed in parallel: could this at least partially explain why that PR did not yield significant improvements? Furthermore,find
is a read-only operation, whereas most of the ones listed above are read-write. ↩
Are there any notes from your past trials which include hardware specifications? I think making concurrent I/O will have different performance on different hardware products. Yes, concurrent processing will provide some performance improvements, but on some hardware it won't. It is better to think on processing side concurrency and hardware (controller, physical type) side concurrency together.
The latest benchmark was run on the hardware described above, and can be trivially replicated on any hardware you happen to have available. Doing so should likely result in some speedup, but its magnitude may differ according to all factors listed in my previous message.
The old one (deletion of many millions of files) was IIRC on a 15 nodes glusterfs cluster running on top of a large VMware farm (hundreds of servers) attached to a large SAN. The exact hardware details are lost to time (this was ~10 years ago) but, I would argue, are ultimately irrelevant. My point wasn't that in that specific case it was beneficial, but rather that I've seen it reduce wall-time in multiple scenarios, over multiple years, running from large networked file systems, to Linux servers, to modern laptops (as evidenced by the latest benchmark).
I agree though there may be cases in which doing I/O concurrently will not provide much benefit. Crucially though, this is at least on my part just a hunch without evidence (contrary to the opposite case), while also being the bulk of the argument why this functionality ought to be at least opt-in or opt-out. Just FTR being a hunch without proof is not for lack of trying: I tried running that newest benchmark on a couple of laptops I have (in addition to the Mac above, a Dell 9360 with W11/WSL2) and a throwaway GCP VM with a persistent SSD volume: in all cases wall time improved. At this point I would be extremely curious about any concrete counterexample where wall time becomes worse (apart from that findutils experiment, that looks at least to my eye somewhat flawed as outlined in my previous message) to drive the conversation forward.
I started collecting/summarizing the available experimental evidence, either in favor or against this proposal, in the first post of this issue. If additional evidence comes up please ping me and I'll add it as well.
Yeah, I would be happy to use more parallelism in this project :)