s4cmd syncing subset of a directory using leaf file prefix is needlessly super slow

syncing subset of a directory using leaf file prefix is needlessly super slow

Open trey-roadster opened this issue 8 years ago • 1 comments

I have a bucket with a million files that are spread well in terms of their key pattern. (e.g., JF2SJAEC8HH466546.xml, 4S4BSANC7H3257612.xml, etc). Often I want to fetch a subset to my local disk that I would describe in s3cmd by "s3://mybucket/dir1/4S4B", using a prefix match. My understanding is that with s4cmd I instead pass "s4://mybucket/dir1/4S4B*" to the sync command, which is all fine to use the wildcard.

The problem is that it takes forever for s4cmd to figure out the set of files to copy. Running with --debug, it looks to me like it's enumerating the whole dir and matching my pattern on the client. Since I only want 100 out of a million files, this is a poorly performing approach for my case (that is, if my assumptions are right, I haven't read the code).

I would suggest the optimization where the source path up to the first wildcard should be used a prefix to the underlying AWS API, and the number of file names fetched would be be cut enormously. Although s3cmd was a pig copying each file one by one, I believe it was smarter about this first enumeration step.

Also, congrats on a good looking tool, I have been wanting something with threading on the cmdline for a long time.

Oct 12 '16 03:10 trey-roadster

I'm also seeing this. I love the threading, but the setup time on operations can be really expensive:

time s4cmd -c 8 get s3://mybucket/with/lots/of/files/somepattern*
real    7m5.128s
user    2m50.316s
sys    0m12.220s


time s3cmd get s3://mybucket/with/lots/of/files/somepattern*
real    0m9.812s
user    0m0.476s
sys    0m0.124s

Is it implemented this way because s4cmd supports some wildcard patterns that the AWS APIs do not?

Wildcard support: Wildcards, including multiple levels of wildcards, like in Unix shells, are handled. For example: s3://my-bucket/my-folder/20120512/*/*chunk00?1?

If I can work out a nice way to use AWS API wildcards when possible, and do the client-side wildcards when needed, I'll submit a PR.

Jul 21 '17 14:07 bprodoehl

s4cmd s4cmd copied to clipboard

syncing subset of a directory using leaf file prefix is needlessly super slow

s4cmd
s4cmd copied to clipboard