Increase copy speed by re-enabling buffering for multithreaded copies from local filesystems
What is the purpose of this change?
Make multithreaded copies from local to S3 fast again (by re-enabling buffering)!
From v1.64.2 to v1.65+, the performance of multithreaded copies from local to S3 decreased. After some investigation and debugging, it seems this is a side effect of this commit / issue https://github.com/rclone/rclone/issues/7350 - more specifically disabling buffering when copying from local filesystems.
I executed some tests in a m6a.2xlarge AWS EC2 instance (Network up to 12.5 Gbps and EBS up to 10 Gbps) and here are the results:
Copy of a 64 GB file from the local filesystem to S3
time ./rclone copyto --s3-no-check-bucket --ignore-checksum --s3-disable-checksum --progress --s3-upload-cutoff=0 --multi-thread-cutoff=256M --multi-thread-streams 20 --disable=copy --no-check-dest <Local File> <S3 Bucket>
| Buffering | Time | Avg Speed |
|---|---|---|
disabled |
18 min 46s | ~58 MB/s |
enabled |
9m 10s | ~121 MB/s |
I'm not completely sure about the memory consumption implications - but alternatively if it cannot be enabled by default - could we consider making it configurable?
What do you think? Looking forward to some input and feedback!
Was the change discussed in an issue or in the forum before?
No
Checklist
- [x] I have read the contribution guidelines.
- [x] I have added tests for all changes in this PR if appropriate.
- [x] I have added documentation for the changes if appropriate.
- [x] All commit messages are in house style.
- [x] I'm done, this Pull Request is ready for review :-)
Hmm, interesting. Fundamentally the disk should read at the same speed into the s3 multipart buffer (how it is at the moment) or into a memory buffer (like it used to be). Given that disk read speeds > network speeds why is this making a difference?
My guess is that it is because the s3 backend reads each block 3 times (once to MD5 it, once to sign it and once to send it). Before it read once off disk and twice out of RAM. Now it is reading 3 times off disk.
The OS should have cached the 2nd and 3rd reads though but...we may well have disabled that with fadvise.
Try this patch and see if it make a difference.
diff --git a/backend/local/local.go b/backend/local/local.go
index 14effd2a9..adc4568c3 100644
--- a/backend/local/local.go
+++ b/backend/local/local.go
@@ -1350,7 +1350,7 @@ func (o *Object) Open(ctx context.Context, options ...fs.OpenOption) (in io.Read
if err != nil {
return
}
- wrappedFd := readers.NewLimitedReadCloser(newFadviseReadCloser(o, fd, offset, limit), limit)
+ wrappedFd := readers.NewLimitedReadCloser(fd, limit)
if offset != 0 {
// seek the object
_, err = fd.Seek(offset, io.SeekStart)
Disabling fadvise was discussed in https://github.com/rclone/rclone/issues/7886 - maybe we should.
Second guessing the OS is probably a bad idea since I'm sure the linux kernel developers are better at memory management than me :-)
hi, @ncw , thanks for your reply. Your explanation actually makes a lot of sense (way better than the theory I came up with 😅). I tried your patch and got basically the same improvement (~2x speed increase) 🎉
Same test (v1.70.3 with fadvise disabled):
Transferred: 64.297 GiB / 64.297 GiB, 100%, 110.439 MiB/s, ETA 0s
Transferred: 1 / 1, 100%
Elapsed time: 8m50.0s
real 8m50.098s
So my impression based on the issue you linked is that the "proper" way forward here would be to make fadvise configurable - if that's the case I would close this PR.
It may be beyond my linux / golang skills, but I could give a try at https://github.com/rclone/rclone/issues/7886 and making it configurable.
Closing this PR as there is a bigger discussion about this topic here: https://github.com/rclone/rclone/pull/8723
hi, @ncw , I'm re-opening this issue because after running more tests in a different setup, it seems fadvise is not the only thing impacting speeds 😅
I'm running a setup in Kubernetes where some SMB volumes are mounted in the nodes. I'm using rclone to transfer files from this storage to S3. Since the volumes are mounted in the nodes, rclone uses the "local" backend (not the SMB one).
Running the same test (64 GB file upload to S3 - g4dn.4xlarge EC2 instances in AWS Outpost):
| Setting | Avg Speed |
|---|---|
fadvise disabled |
~165 MB/s |
buffering re-enabled |
~310 MB/s |
buffering re-enabled + fadvise disabled |
~347 MB/s |
I think since we are reading data from a network disk (SMB), disabling buffering has a huge impact when doing the multithreaded uploads to S3. What do you think?
This is the conclusion that the restic project came to - disable fadvise and have RAM buffing for max performance.
I don't want to balloon the memory usage of rclone though there was a reason we did this - issue #7350 - that issue has a lot of good stuff in - it is worth a read.
Adaping the table from #7350 this is what is actually implemented now
-
RAM: Buffer in RAM -
None: Don't buffer, but re-read from the source if necessary.
The logic is RAM unless
- the destination supports
OpenWriterAt, eg local, azurefiles, smb, pcloud =>None - the source is local =>
None - the destination supports
OpenChunkWriterand promises not to seek its chunks except for retries, eg b2 =>None
| Source backend | Destination backend | Buffering |
|---|---|---|
| local | any | None¹ |
| any | local/azurefiles/smb/pcloud | None² |
| any | b2 | None |
| any | s3/azureblob/oos | RAM |
And the notes from before
¹ Needs performance testing to see if it slows stuff down a lot! Might need to be RAM Yes it does slow stuff down
² It works like this at the moment as the local backend never needs retries. (OpenWriterAt doesn't read the data twice)
Does that mean we should make it configurable? I hesitate to add yet another configuration flag for the poor users though. #7350 suggests a --low-memory flag which would make sure we used disk buffering for local reads.
Perhaps a more targeted flag like --multi-thread-low-memory which if true uses None strategy, so make the rules
The logic is RAM unless
- the destination supports
OpenWriterAt, eg local, azurefiles, smb, pcloud =>None - the source is local and
--multi-thread-low-memory==true=>None - the destination supports
OpenChunkWriterand promises not to seek its chunks except for retries, eg b2 =>None
We do now have --max-buffer-memory which can be used to control how much memory rclone is using and that does work pretty well which could be used instead of --multi-thread-low-memory.
What do you think @vitorog ?
Strictly for SMB (https://linux.die.net/man/8/mount.cifs), there also seem to be 2 options available that the users can tune themselves:
- The FS-Cache via the
fscoption - Setting
cachetolooserather than the default value ofstrict
I wonder if any of these allow you to get the same performance as that of RAM caching via rclone without needing any changes.