s3fs icon indicating copy to clipboard operation
s3fs copied to clipboard

keep prefix after glob?

Open raybellwaves opened this issue 4 years ago • 3 comments

I often use glob to get a list of files. I may loop over this to do similar operations over a file. It'll be nice if I could keep the prefix e.g. (s3://) instead of adding that back in myself:

import s3fs
fs = s3fs.S3FileSystem()

files = fs.glob("s3://bucket/YYYYMM*")
# files = [bucket/path/YYYYMM01, bucket/path/YYYYMM02] ...

Would like

# files = [s3://bucket/path/YYYYMM01, s3://bucket/path/YYYYMM02] ...

Easy to add a user but common enough that it would be nice if I could avoid this step. e.g. have a keep_prefix=True kwarg?

Current work arounds

for file in files:
    pd.read_parquet("s3://" + file)

or

files = ["s3://" + file for file in files]

raybellwaves avatar Aug 16 '21 18:08 raybellwaves

I agree that this would be a nice feature to have, although preferably at the fsspec level for consistency with other providers. I am not sure how well that would work with various chained schemes like fsspec supports.

You could maybe achieve a similar approach using UPath, a nifty package that builds on pathlib using fsspec for the underlying filesystem interactions. Using that, you could do something like:

from upath import UPath

upath = UPath("s3://bucket/YYYYMM")

for file in upath.glob("*"):
    pd.read_parquet(file)

brl0 avatar Aug 16 '21 20:08 brl0

Hurray for UPath! I didn't know it had gone so far. Is it ready to be generally recommended?

Getting back into the specific question, it's not totally unreasonable, but I think would break other things, and is especially tricky for implementations that may have multiple protocols.

I was thinking that it might be a good idea (e.g., https://github.com/intake/filesystem_spec/pull/723 ) to have a general-purpose API, so you would do fsspec.glob("s3://...") and expect "s3://..." paths back.

martindurant avatar Aug 16 '21 20:08 martindurant

Hurray indeed! I think UPath is awesome and well designed, thanks, of course, to s3fs and fsspec, so hurray for those too! :)

Not sure if it is ready for general recommendation yet, I will defer to the author @andrewfulton9 on that.

There is an s3fs specific implementation that seems to have been around for some time and to my knowledge seems stable, which is why I thought it might be fitting for this particular scenario.

brl0 avatar Aug 16 '21 21:08 brl0