keep prefix after glob?
I often use glob to get a list of files. I may loop over this to do similar operations over a file. It'll be nice if I could keep the prefix e.g. (s3://) instead of adding that back in myself:
import s3fs
fs = s3fs.S3FileSystem()
files = fs.glob("s3://bucket/YYYYMM*")
# files = [bucket/path/YYYYMM01, bucket/path/YYYYMM02] ...
Would like
# files = [s3://bucket/path/YYYYMM01, s3://bucket/path/YYYYMM02] ...
Easy to add a user but common enough that it would be nice if I could avoid this step. e.g. have a keep_prefix=True kwarg?
Current work arounds
for file in files:
pd.read_parquet("s3://" + file)
or
files = ["s3://" + file for file in files]
I agree that this would be a nice feature to have, although preferably at the fsspec level for consistency with other providers. I am not sure how well that would work with various chained schemes like fsspec supports.
You could maybe achieve a similar approach using UPath, a nifty package that builds on pathlib using fsspec for the underlying filesystem interactions. Using that, you could do something like:
from upath import UPath
upath = UPath("s3://bucket/YYYYMM")
for file in upath.glob("*"):
pd.read_parquet(file)
Hurray for UPath! I didn't know it had gone so far. Is it ready to be generally recommended?
Getting back into the specific question, it's not totally unreasonable, but I think would break other things, and is especially tricky for implementations that may have multiple protocols.
I was thinking that it might be a good idea (e.g., https://github.com/intake/filesystem_spec/pull/723 ) to have a general-purpose API, so you would do fsspec.glob("s3://...") and expect "s3://..." paths back.
Hurray indeed! I think UPath is awesome and well designed, thanks, of course, to s3fs and fsspec, so hurray for those too! :)
Not sure if it is ready for general recommendation yet, I will defer to the author @andrewfulton9 on that.
There is an s3fs specific implementation that seems to have been around for some time and to my knowledge seems stable, which is why I thought it might be fitting for this particular scenario.