s3fs icon indicating copy to clipboard operation
s3fs copied to clipboard

Glob ** support could be more efficient

Open danielnaab opened this issue 6 years ago • 5 comments

When globbing paths with **, such as:

s3://my-bucket/**

The underlying fsspec implementation (referenced in #54) will iterate over each "directory" in the bucket, leading to multiple api calls. However, boto3's list_objects_v2 function will by default return all sub-keys with the given prefix, if the delimiter is omitted:

response = s3.list_objects_v2(Bucket="my-bucket", Prefix="")
response["Contents"]  == all keys in the bucket

The above possibility implies there is an optimization available to avoid repeated calls to S3Filesystem._lsdir, which should significantly speed up this operation, by removing the delimiter argument as specified here: https://github.com/dask/s3fs/blob/master/s3fs/core.py#L325 (Understanding, of course, that globbing ** in the middle of the path adds extra complexity)

danielnaab avatar Aug 13 '19 20:08 danielnaab

Although you are totally right, getting all files below a fixture like this is the minority case, and so we choose to optimise for what I think it more typical: one- or two-level globs. It might be acceptable to add shortcuts that use this call instead within glob and find, so long as the default is maintained.

martindurant avatar Aug 13 '19 20:08 martindurant

Although you are totally right, getting all files below a fixture like this is the minority case, and so we choose to optimise for what I think it more typical: one- or two-level globs. It might be acceptable to add shortcuts that use this call instead within glob and find, so long as the default is maintained.

Hi @martindurant thanks for the quick response. Yes makes sense, this optimization can only safely be done if the pattern ends in **, or if knowledge of the bucket structure exists... Also, in addition to find, walk would benefit with the default maxdepth=None.

danielnaab avatar Aug 13 '19 20:08 danielnaab

(except that walk explicitly doesn't work like that, but iterates directory-wise. You would have to have two very separate code paths)

Are you interested in implementing this?

martindurant avatar Aug 13 '19 20:08 martindurant

Sorry I'm using boto3 directly for my use case, but wanted to share the speed up opportunity. If I can get some time scheduled I'd be happy to revisit and try my hand but I don't have the time ATM.

Also, thinking through this a bit, I don't think there is any circumstances where not using this approach makes sense, as a ** glob in the middle of a path still needs to query for every subdirectory to the right of the **. (Unless I misunderstand the semantics of **.)

danielnaab avatar Aug 13 '19 21:08 danielnaab

Correct, but glob rarely includes **. I would love to see a PR sometime, I can't promise when I might work on it. Perhaps someone else will. I still maintain that this is a relatively rare usage.

martindurant avatar Aug 13 '19 21:08 martindurant