fsspec URL chaining
This is not urgent, but based on my very limited and quick testing, it seems that fsspec URL chaining may not work properly. It is possible I was not doing it properly, or there maybe some clever way to work around the limitation, but I thought it was worth raising an issue for now.
Note: a url-chaining implementation should also support other chaining styles, see: https://github.com/zarr-developers/zeps/pull/48
Hi, I'm interested in getting this functionality. Is there a lot to do for this? If not, I could maybe have a look?
Thank you for offering to contribute!
To implement url chaining support two items have to be completed:
- We need to first parse the chained url into (protocol, path, and storage options) for each protocol in the chain. All of this functionality is already available in filesystem_spec, and the function that does this in upath should rely as much as possible on the filesystem_spec implementation.
- The
UPathclass should get an attribute (name up for debate).chainthat acts much likeUPath.parentsbut provides access to the individual links/filesystems of the chained url. To make this work correctly, a upath instance would have to keep track of the (protocol, path, storage options) tuples before and after the current filesystem.
To provide an example of how this should look like:
# interface does not exist yet
# this is just a mockup
>>> from upath import UPath
>>> pth = UPath("simplecache::zip://path/in/archive/spreadsheet.csv::s3://mybucket/data.zip")
S3Path("simplecache::zip://path/in/archive/spreadsheet.csv::s3://mybucket/data.zip")
>>> len(pth.chain)
3
>>> pth.chain[-2]
ZipPath("simplecache::zip://path/in/archive/spreadsheet.csv::s3://mybucket/data.zip")
>>> pth2 = pth.chain[-2]
>>> pth2.with_name("other_spreadsheet.csv")
ZipPath("simplecache::zip://path/in/archive/other_spreadsheet.csv::s3://mybucket/data.zip")
When this gets implemented, there are a very likely few complications that would have to be solved. For example, the various caches implemented in filesystem_spec don't take a path, they just use the path of the next filesystem in the chain. So this should probably be special-cased in UPath somehow.
Also when instantiating UPath with a chained url, storage options should now be provided as dicts with the protocol as the target. I.e. UPath("zip://pth1::s3://bucket/pth2", zip={...}, s3={...})
I'll try to provide more information once I am back from traveling!
Cheers, Andreas