coreutils icon indicating copy to clipboard operation
coreutils copied to clipboard

Concurrent disk I/O

Open CAFxX opened this issue 1 year ago • 3 comments

Commands that perform a lot of disk I/O (cp, mv, rm, etc.), especially when operating on SSDs[^1] but potentially also on slower disks[^2], are likely to complete much faster if they issue concurrent I/O requests (ideally asynchronously and in bulk[^3], but synchronously via a thread pool could also work).

Examples of cases where this could be very beneficial include:

  • cp -r <source> <dest> where <source> contains large number of files/subdirectories
  • cp <sources> <dest> where <sources> is a large number of files
  • rm -r <source> where <source> contains large number of files/subdirectories
  • rm <sources> where <sources> is a large number of files
  • mv <sources> <dest> where <sources> is a large number of files
  • ln <sources> <dest> where <sources> is a large number of files
  • ls <path> and ls -r <path> where <path> contains large number of files/subdirectories
  • du <path> where <path> contains large number of files/subdirectories

(and so on, the list above is not exhaustive: examples include stat, readlink, touch, chown, chgrp, chmod, etc.)

Whether this behavior should be opt-out/default-off or opt-in/default-on, I lean towards opt-out but I must admit I am not sure. With ever-increasing core counts and ever-higher capacity disks, I would argue there is a pretty strong case for the behavior to be opt-out rather than opt-in. At the same time, there may be concerns about these tools suddenly generating large I/O spikes, in which case it should be opt-in. In all cases, the proposed behavior should never affect the correctness of the command, or its semantics (e.g. ordering of results, behavior on error, etc.).

Currently, the closest that can be achieved with standard tools is running processes in parallel by using something like xargs -P<n>, but this is somewhat easy to get wrong (e.g. forgetting to use -0, dynamically choosing between -I{} and normal mode based on how many files we need to operate on and how long their names are, etc) and it does not, in general, maintain the standard semantics of the tools (e.g. ordering needs to be done separately, handling errors is left to the user, etc.)

[^1]: as they tend to deliver higher throughput when operating with high queue depths [^2]: as the OS can potentially coalesce a greater number of logical writes into a smaller number of disk write operations [^3]: especially if via io_uring or similar mechanisms that can guarantee that in practice file operations never block

CAFxX avatar Feb 14 '23 01:02 CAFxX

Interesting ideas! This should be something that Rust excels at over C too, with more thread safety and convenience.There are two things I think we should keep in mind here.

  1. I think this should not complicate or bloat the code too much.
  2. Part of compatibility with GNU is doing operations in the same order. For instance, if you do 2 operations in order and you exit if the first one fails, the second operation never gets executed. If you make those two operations concurrent, they will both have executed. In cases where this is an issue, we might need some flag or something that specifies that the order is random.

tertsdiepraam avatar Feb 14 '23 23:02 tertsdiepraam

@thecotne tried with findutils: https://github.com/uutils/findutils/pull/142 and it didn't provide much ROI

sylvestre avatar Feb 15 '23 13:02 sylvestre

@sylvestre I can acknowledge ROI will depend on the OS, the filesystem in use[^1], the underlying hardware, the implementation[^2], the operation(s) being performed, the configuration in use, whether the data is already in the page cache or not, and so on... but I have definitely seen it improve performance multiple times in the past.

Case in point, I just ran this very synthetic benchmark on my laptop (a recent-ish macbook) and even my utterly unrefined bash "implementation" does improve throughput by 1.5x~2x:

$ N=1000000
$ P=100

$ time (seq 1 $N | xargs touch)
real	2m17.307s
user	0m2.254s
sys	1m31.966s
$ ls -1 | wc -l
 1000000

$ time (find . -type f | xargs rm)
real	4m28.594s
user	0m2.096s
sys	2m3.315s
$ ls -1 | wc -l
       0

$ time (seq 1 $N | xargs -P$P touch)
real	2m2.976s
user	0m4.890s
sys	9m55.269s
$ ls -1 | wc -l
 1000000

$ time (find . -type f | xargs -P$P rm)
real	3m9.421s
user	0m3.746s
sys	5m55.275s
$ ls -1 | wc -l
       0

$ N=100000
$ P=100

$ time (seq 1 $N | xargs touch)
real	0m9.580s
user	0m0.200s
sys	0m8.054s
$ ls -1 | wc -l
  100000

$ time (find . -type f | xargs rm)
real	0m18.979s
user	0m0.192s
sys	0m9.386s
$ ls -1 | wc -l
       0

$ time (seq 1 $N | xargs -P$P touch)
real	0m6.558s
user	0m0.340s
sys	0m31.672s
$ ls -1 | wc -l
  100000

$ time (find . -type f | xargs -P$P rm)
real	0m15.399s
user	0m0.274s
sys	0m12.768s
$ ls -1 | wc -l
       0

$ N=10000
$ P=100

$ time (seq 1 $N | xargs -I{} touch {})
real	0m31.932s
user	0m4.369s
sys	0m14.138s
$ ls -1 | wc -l
   10000

$ time (find . -type f | xargs -I{} rm {})
real	0m37.942s
user	0m5.437s
sys	0m16.991s
$ ls -1 | wc -l
       0

$ time (seq 1 $N | xargs -P$P -I{} touch {})
real	0m13.282s
user	0m4.663s
sys	0m21.291s
$ ls -1 | wc -l
   10000

$ time (find . -type f | xargs -P$P -I{} rm {})
real	0m14.835s
user	0m6.580s
sys	0m28.038s
$ ls -1 | wc -l
       0

I don't have access to it anymore, so I can't run experiments, but in the past being able to run large deletes concurrently on a gluster volume cut down wall time from many hours to few tens of minutes.

[^1]: Some filesystems are known to have not-so-great performance behavior under high concurrency, although one may argue it is also because existing tools are not great at generating concurrent I/O (case in point) and also because in the past there was definitely lower headroom available to run I/O concurrently. [^2]: Probably worth pointing out that IIUC in https://github.com/uutils/findutils/pull/142 the loop in process_dir seems to not actually be executed in parallel: could this at least partially explain why that PR did not yield significant improvements? Furthermore, find is a read-only operation, whereas most of the ones listed above are read-write.

CAFxX avatar Feb 15 '23 22:02 CAFxX

@sylvestre I can acknowledge ROI will depend on the OS, the filesystem in use1, the underlying hardware, the implementation2, the operation(s) being performed, the configuration in use, whether the data is already in the page cache or not, and so on... but I have definitely seen it improve performance multiple times in the past.

Case in point, I just ran this very synthetic benchmark on my laptop (a recent-ish macbook) and even my utterly unrefined bash "implementation" does improve throughput by 1.5x~2x:

$ N=1000000
$ P=100

$ time (seq 1 $N | xargs touch)
real	2m17.307s
user	0m2.254s
sys	1m31.966s
$ ls -1 | wc -l
 1000000

$ time (find . -type f | xargs rm)
real	4m28.594s
user	0m2.096s
sys	2m3.315s
$ ls -1 | wc -l
       0

$ time (seq 1 $N | xargs -P$P touch)
real	2m2.976s
user	0m4.890s
sys	9m55.269s
$ ls -1 | wc -l
 1000000

$ time (find . -type f | xargs -P$P rm)
real	3m9.421s
user	0m3.746s
sys	5m55.275s
$ ls -1 | wc -l
       0

$ N=100000
$ P=100

$ time (seq 1 $N | xargs touch)
real	0m9.580s
user	0m0.200s
sys	0m8.054s
$ ls -1 | wc -l
  100000

$ time (find . -type f | xargs rm)
real	0m18.979s
user	0m0.192s
sys	0m9.386s
$ ls -1 | wc -l
       0

$ time (seq 1 $N | xargs -P$P touch)
real	0m6.558s
user	0m0.340s
sys	0m31.672s
$ ls -1 | wc -l
  100000

$ time (find . -type f | xargs -P$P rm)
real	0m15.399s
user	0m0.274s
sys	0m12.768s
$ ls -1 | wc -l
       0

$ N=10000
$ P=100

$ time (seq 1 $N | xargs -I{} touch {})
real	0m31.932s
user	0m4.369s
sys	0m14.138s
$ ls -1 | wc -l
   10000

$ time (find . -type f | xargs -I{} rm {})
real	0m37.942s
user	0m5.437s
sys	0m16.991s
$ ls -1 | wc -l
       0

$ time (seq 1 $N | xargs -P$P -I{} touch {})
real	0m13.282s
user	0m4.663s
sys	0m21.291s
$ ls -1 | wc -l
   10000

$ time (find . -type f | xargs -P$P -I{} rm {})
real	0m14.835s
user	0m6.580s
sys	0m28.038s
$ ls -1 | wc -l
       0

I don't have access to it anymore, so I can't run experiments, but in the past being able to run large deletes concurrently on a gluster volume cut down wall time from many hours to few tens of minutes.

Footnotes

  1. Some filesystems are known to have not-so-great performance behavior under high concurrency, although one may argue it is also because existing tools are not great at generating concurrent I/O (case in point) and also because in the past there was definitely lower headroom available to run I/O concurrently.
  2. Probably worth pointing out that IIUC in https://github.com/uutils/findutils/pull/142 the loop in process_dir seems to not actually be executed in parallel: could this at least partially explain why that PR did not yield significant improvements? Furthermore, find is a read-only operation, whereas most of the ones listed above are read-write.

Are there any notes from your past trials which include hardware specifications? I think making concurrent I/O will have different performance on different hardware products. Yes, concurrent processing will provide some performance improvements, but on some hardware it won't. It is better to think on processing side concurrency and hardware (controller, physical type) side concurrency together.

raxetul avatar Mar 17 '23 08:03 raxetul

The latest benchmark was run on the hardware described above, and can be trivially replicated on any hardware you happen to have available. Doing so should likely result in some speedup, but its magnitude may differ according to all factors listed in my previous message.

The old one (deletion of many millions of files) was IIRC on a 15 nodes glusterfs cluster running on top of a large VMware farm (hundreds of servers) attached to a large SAN. The exact hardware details are lost to time (this was ~10 years ago) but, I would argue, are ultimately irrelevant. My point wasn't that in that specific case it was beneficial, but rather that I've seen it reduce wall-time in multiple scenarios, over multiple years, running from large networked file systems, to Linux servers, to modern laptops (as evidenced by the latest benchmark).

I agree though there may be cases in which doing I/O concurrently will not provide much benefit. Crucially though, this is at least on my part just a hunch without evidence (contrary to the opposite case), while also being the bulk of the argument why this functionality ought to be at least opt-in or opt-out. Just FTR being a hunch without proof is not for lack of trying: I tried running that newest benchmark on a couple of laptops I have (in addition to the Mac above, a Dell 9360 with W11/WSL2) and a throwaway GCP VM with a persistent SSD volume: in all cases wall time improved. At this point I would be extremely curious about any concrete counterexample where wall time becomes worse (apart from that findutils experiment, that looks at least to my eye somewhat flawed as outlined in my previous message) to drive the conversation forward.

CAFxX avatar Mar 17 '23 15:03 CAFxX

I started collecting/summarizing the available experimental evidence, either in favor or against this proposal, in the first post of this issue. If additional evidence comes up please ping me and I'll add it as well.

CAFxX avatar Mar 26 '23 00:03 CAFxX

Yeah, I would be happy to use more parallelism in this project :)

sylvestre avatar Mar 26 '23 06:03 sylvestre