rsync
rsync copied to clipboard
Parallelize the rsync run using multiple threads and/or connections (old Bugzilla bug 5124)
I am not sure it is appropriate to manually refile bugs from the old Bugzilla to the new issue tracker at GitHub, but there is an issue that is largely a show-stopper for me and for many other users, so if possible I'd like to "modernise" the issue but moving it here.
https://bugzilla.samba.org/show_bug.cgi?id=5124
The idea is to use multiple TCP connections to download a single piece of data by splitting it into N pieces and downloading them in parallel.
This used to be less useful when the Internet was young, and multiple connections would only help to save on TCP connection establishing time with many tiny files. However, nowadays several incredibly huge ISPs are clamping non-DPI-able connections to 10k/s or so. Rsync, being done through SSH, obviously belongs to "non-DPI-able" connections and therefore becomes almost useless unless it can work in parallel and obtain an Nthreads*10k/s speed.
I would, therefore, ask for this feature to be implemented. The algorithm already used by Aria2 (and its --split) option is likely to be a good candidate.
In fact, aria2 can even do downloading through ssh, as documented here: https://github.com/aria2/aria2/issues/453
I suggest you remove "threads and/or" from the title. It's clear that the original submitter is asking for parallelizing transfers over multiple connections. Whether that happens in multiple threads or not is irrelevant.
This would be extremely useful, as for high-speed links (10/40/100Gbit) it is absolutely necessary to get useful transfer speeds. A single connection transfer quickly becomes CPU-bound. This is especially true if SSH is used as the transfer protocol, in that case this could easily increase transfer speed by an order of magnitude with enough cores.
I've had to work around the lack of this functionality in the past by splitting very large transfers up into batches and running multiple rsync processes, but it's inconvenient manual effort.
I am not sure it is appropriate to manually refile bugs from the old Bugzilla to the new issue tracker at GitHub, but there is an issue that is largely a show-stopper for me and for many other users, so if possible I'd like to "modernise" the issue but moving it here.
If we want only one issue I suggest closing the old one and keeping this one. The bugzilla ticket is practically a case study in how not to write a summary for a ticket. Thanks for doing this, it's an important feature and deserves a clear ticket.
This would be extremely useful, as for high-speed links (10/40/100Gbit)
@dsg22 That will be extremely useful for slow links as well, if you upload files through a lot of hopes and some ISP shapes traffic.
This seems like a huge oversight, and despite searching, I haven't seen any reasoning for why this feature hasn't been implemented. What would be the rationale for a "single stream by design" ?
It requires a complete rewrite of the utility using a totally new protocol stream. Something that I've been considering, but is not simple nor going to happen any time soon.
I would suggest that this could be done without protocol changes.
- an initial connection with a single local->remote instance, which scans the destination filesystem (I believe rsync already does this)
- once complete, the parent rsync could spawn $x sub-rsync processes, which all simply run, but take a filename (or file list) from the parent (eg, socket, or stdin, or some such)
The above could be done in two stages.
-
The first would only allow non-destructive operations, that is, simply "move data".
-
After the first run, rsync could drop all children, and re-scan the filesystem according to any destructive operations. This will ensure that --delete, and other operations, are running in single threaded mode, once a 100% verification of how the remote state is left, is done.
This is the manual process many of us are following.
It may seem hacky, but really, I think any parallel operation via rsync will require spawning sub-rsync + ssh sessions regardless. So having rsync simply spawn itself with alternate command line args, would work well here.
Even disabling some aspects of rsync for this mode would be well understood, I think. That is, "you cannot use destructive operations when doing parallel transfers", or some such.
So users could run in parallel mode, shove the data across, then re-run in 'destructive operations' mode, eg --delete and other flags of this nature.
I don't know if the above is helpful, but I do know that it would be helpful for my use case.
Thanks!
That is something that someone could write as a wrapping script (and indeed, is something that I have implemented before where my script copied directory trees in order of largest to smallest, keeping a certain number of parallel rsyncs active at the same time). This is not something that rsync itself should do, though, as the extra scan of the source files would be a bad idea -- the user's script should use its own cached knowledge of what to copy in what order and could even update the sizes based on the output of the rsync copies (which is a reasonable starting point for the next copy). An actual parallel rsync would farm out individual files to a parallel set of rsyncs, as that doesn't require any extra scanning or knowledge but does require a new protocol.