s3cmd
s3cmd copied to clipboard
Use threads for glacier restore for 100x speedup.
Each call to the restore API has significant latency, but very little data. For restoring large sets of files, the serialized request loop is very, very slow. Parallel requests make a significant speedup for large sets of files.
When hammering S3 with a large set of requests, some of them will timeout (even if run serially). Adding retry logic alleviates the need to re-run the command from scratch.
I tried using s3cmd restore
on a directory tree with tens of thousands of files. From the command output, it looked like s3cmd was issueing 3-4 requests per second, and the operation would have taken hours to complete. It would have, because after about 20 minutes, it failed because one request fimed out and the operation failed. Running the command again got me a bit further, but it eventually failed a 2nd time. I think I would have needed to run in a shell loop for many tries to eventually get through all my files.
Adding retries and using a thread pool got me to the point where I could successfully restore the whole directory tree in about 10 minutes, running the command once.
Thank you for your contribution and sorry for the delay in reviewing. The idea is pretty good. I don't think that we will merge it for the moment, because I think that your PR scratch the surface of something more global that has to be done for s3cmd. A long requested feature is to have more operations done in parallel. So, that means not only for restore, but also for upload, download, etc... and that being user configurable. So, if that work is done, probably you work would be merged as part of it.
The issue is a little more complex than what it looks because the point is that we can see 2 different kind of parallelism: same task spited (like for the restore here), and different tasks in parallel (like having the local list and remote list built in the same time.
any news ?