s4cmd
s4cmd copied to clipboard
syncing subset of a directory using leaf file prefix is needlessly super slow
I have a bucket with a million files that are spread well in terms of their key pattern. (e.g., JF2SJAEC8HH466546.xml, 4S4BSANC7H3257612.xml, etc). Often I want to fetch a subset to my local disk that I would describe in s3cmd by "s3://mybucket/dir1/4S4B", using a prefix match. My understanding is that with s4cmd I instead pass "s4://mybucket/dir1/4S4B*" to the sync command, which is all fine to use the wildcard.
The problem is that it takes forever for s4cmd to figure out the set of files to copy. Running with --debug, it looks to me like it's enumerating the whole dir and matching my pattern on the client. Since I only want 100 out of a million files, this is a poorly performing approach for my case (that is, if my assumptions are right, I haven't read the code).
I would suggest the optimization where the source path up to the first wildcard should be used a prefix to the underlying AWS API, and the number of file names fetched would be be cut enormously. Although s3cmd was a pig copying each file one by one, I believe it was smarter about this first enumeration step.
Also, congrats on a good looking tool, I have been wanting something with threading on the cmdline for a long time.
I'm also seeing this. I love the threading, but the setup time on operations can be really expensive:
time s4cmd -c 8 get s3://mybucket/with/lots/of/files/somepattern*
real 7m5.128s
user 2m50.316s
sys 0m12.220s
time s3cmd get s3://mybucket/with/lots/of/files/somepattern*
real 0m9.812s
user 0m0.476s
sys 0m0.124s
Is it implemented this way because s4cmd supports some wildcard patterns that the AWS APIs do not?
Wildcard support: Wildcards, including multiple levels of wildcards, like in Unix shells, are handled. For example: s3://my-bucket/my-folder/20120512/*/*chunk00?1?
If I can work out a nice way to use AWS API wildcards when possible, and do the client-side wildcards when needed, I'll submit a PR.