s5cmd
s5cmd copied to clipboard
's5cmd cat' sub command is not using concurrent connections
's5cmd cat' sub command is not using concurrent connections, like 's5cmd cp' does.
My use-case is downloading a 427 GB tarball from S3 and extracting it on the fly:
time s5cmd cat s3://bucket/file.tar.zst - | pzstd -d | tar -xv -C /
Example EC2 instance type: c5d.9xlarge with 36 CPU cores, 72 GB RAM, 900 GB local SSD
When just comparing the download part with aws cli:
# time aws s3 cp s3://bucket/file.tar.zst - | cat >/dev/null
real 37m56.415s
user 22m50.195s
sys 19m8.677s
(around 192 MB/s)
With 's5cmd cat':
# time s5cmd cat s3://bucket/file.tar.zst >/dev/null
Still running. Only around 85 MB/s on a single S3 connection, according to netstat.
With 's5cmd cp' and writing to disk (without decompression):
time s5cmd cp s3://bucket/file.tar.zst /file.tar.zst
real 23m58.230s
user 7m56.734s
sys 22m40.482s
(around 304 MB/s)
With higher concurrency and larger parts:
# time s5cmd cp -c 36 -p 600 s3://bucket/file.tar.zst /file.tar.zst
real 10m3.064s
user 6m53.378s
sys 41m30.392s
(around 729 MB/s)
cat
command uses stdout as the output "file". stdout is not a seekable writer, meaning we can use multiple connections for download, but can't use multiple threads for writes due to ordering guarantee.
I'm surprised that awscli can achieve better throughput than s5cmd on a similar execution.
We are also facing this exact same behavior, and aws s3 cp ... -
provides better throughput than s5cmd cat. Does s5cmd support copying to stdout?
I'm surprised that awscli can achieve better throughput than s5cmd on a similar execution.
AWSCLI achieves better-than-single-stream-but-worse-than-fully-parallel throughput to stdout with slightly higher initial latency and significant RAM usage by filling a decently sized ring buffer in parallel and only cycling in new chunks when the earliest chunk completes. My understanding from the last time I looked was that s5cmd's cat didn't do this. Anecdotally, it's definitely possible to get better throughput than their python implementation for the same RAM cost, but the RAM cost is not exactly small.
Thanks for the pointers @fiendish. I thought about the same but haven't had the time to read the source code.
We can use the same approach. If anyone wants to contribute, we'd be very happy to review.
I've implemented something for it in Python before as an experiment, but unfortunately I don't know golang so can't easily help here. If anyone is thinking of doing this without wanting to contemplate the method too hard, my dead simple approach for Python was to use a slightly modified concurrent.futures.Executor.map that only allowed max N results resident in RAM at a time (instead of the standard executor that limits in-flight threads but doesn't bound result storage). Then it was just setting some desired N and desired read size per thread, and the threads were bog standard range read requests.
A download a load of compressed files and pipe them directly into the decoder. The lack of parallel downloads when outputting to stdout seems to hurt speed very noticeably, up to the point where it is faster to download the 3-times larger uncompressed version of the data.
I did a quick prototype in Rust ( https://github.com/VeaaC/s3get ) that just uses X threads, keeps results in a sorted binary tree to be written by another thread, and limits the amount of pending data to 2*X blocks. This works very well, and I mostly saturate the 2.5 GBit connection.
I am not very experienced in Go, so I cannot port such an approach myself, but I imagine that it should not be much longer / more difficult.