universal_pathlib icon indicating copy to clipboard operation
universal_pathlib copied to clipboard

fsspec URL chaining

Open brl0 opened this issue 4 years ago • 3 comments

This is not urgent, but based on my very limited and quick testing, it seems that fsspec URL chaining may not work properly. It is possible I was not doing it properly, or there maybe some clever way to work around the limitation, but I thought it was worth raising an issue for now.

brl0 avatar Aug 04 '21 21:08 brl0

Note: a url-chaining implementation should also support other chaining styles, see: https://github.com/zarr-developers/zeps/pull/48

ap-- avatar Sep 30 '23 19:09 ap--

Hi, I'm interested in getting this functionality. Is there a lot to do for this? If not, I could maybe have a look?

mraspaud avatar May 28 '24 14:05 mraspaud

Thank you for offering to contribute!

To implement url chaining support two items have to be completed:

  1. We need to first parse the chained url into (protocol, path, and storage options) for each protocol in the chain. All of this functionality is already available in filesystem_spec, and the function that does this in upath should rely as much as possible on the filesystem_spec implementation.
  2. The UPath class should get an attribute (name up for debate) .chain that acts much like UPath.parents but provides access to the individual links/filesystems of the chained url. To make this work correctly, a upath instance would have to keep track of the (protocol, path, storage options) tuples before and after the current filesystem.

To provide an example of how this should look like:

# interface does not exist yet
# this is just a mockup

>>> from upath import UPath
>>> pth = UPath("simplecache::zip://path/in/archive/spreadsheet.csv::s3://mybucket/data.zip")
S3Path("simplecache::zip://path/in/archive/spreadsheet.csv::s3://mybucket/data.zip")
>>> len(pth.chain)
3
>>> pth.chain[-2]
ZipPath("simplecache::zip://path/in/archive/spreadsheet.csv::s3://mybucket/data.zip")
>>> pth2 = pth.chain[-2]
>>> pth2.with_name("other_spreadsheet.csv")
ZipPath("simplecache::zip://path/in/archive/other_spreadsheet.csv::s3://mybucket/data.zip")

When this gets implemented, there are a very likely few complications that would have to be solved. For example, the various caches implemented in filesystem_spec don't take a path, they just use the path of the next filesystem in the chain. So this should probably be special-cased in UPath somehow.

Also when instantiating UPath with a chained url, storage options should now be provided as dicts with the protocol as the target. I.e. UPath("zip://pth1::s3://bucket/pth2", zip={...}, s3={...})

I'll try to provide more information once I am back from traveling!

Cheers, Andreas

ap-- avatar Jun 01 '24 08:06 ap--