universal_pathlib icon indicating copy to clipboard operation
universal_pathlib copied to clipboard

Serializing s3-compatible endpoint url within path strings

Open Koncopd opened this issue 1 year ago • 3 comments

I want to discuss this. We manage a lot of different paths and need also to support paths in s3-compatible endpoint urls along with normal s3 paths. I understand that an endpoint url can be passed easily via UPath("s3://bucket/key", endpoint_url=endpoint_url). However it really feels that the endpoint url is a part of the path itself, and it would be very useful to have something like UPath("s3://bucket?endpoint_url/key") or UPath("s3://bucket/key?endpoint_url"), also storing such path strings are easier, and differentiating them from normal s3 paths. What do you think?

Koncopd avatar Jan 30 '25 16:01 Koncopd

Hi @Koncopd,

Thanks for opening the issue ❤

It's possible to do what you want by subclassing S3FileSystem and S3Path and registering both in your code. I'll provide a code example later today.

I also understand that sometimes it's just more convenient to have a single string representation for a path. It makes passing around everything just a bit easier.

In general though, I would discourage it. If you want to be as flexible as possible for supporting filesystem_spec filesystems, you need to accept that a path is clearly defined via three items:

the filesystem

  1. protocol (indirectly defines the filesystem_spec AbstractFileSystem subclass)
  2. storage_options (defines how the filesystem is instantiated, and potentially contains chained filesystems)

and the path on the filesystem

  1. path (i.e. some string that points to a resource)

So it's better and more flexible if your application is designed in a way to store the three pieces of information of a unique path. The fact that for some filesystems, some of the storage_options can be serialized into the path, is convenient, but it's easy to construct cases where specific storage_options either can't easily be serialized to a string, or where a let's say query-parameter serialization would introduce ambiguity because the options collide.

If we could collect a few examples, that require a better way of serializing protocol storage_options and path then we could potentially work out a better solution.

Some time ago, when I had a similar issue, I experimented with serializing to json and just storing the json object in my application code: https://github.com/Bayer-Group/pado/blob/635d7b8b57e527254d6302730100a6dab5a2095f/pado/io/files.py#L351-L386 but that is not universal either, because there are (edge) cases where your only option is to pickle the storage_options...

P.S.: there are storage_args too, but so far I think I haven't seen an example where you couldn't map them to keyword arguments.

ap-- avatar Jan 31 '25 13:01 ap--

Here's an example how this could be implemented. Serializing the endpoint_url as a query_parameter didn't need subclassing S3FileSystem. That might be needed if you want to add the endpoint_url to the bucket somehow.

For final use this would need a bunch of more tests of course, since there could be a few strange edge cases dependent on the different path operations.

And as I said, I would recommend just designing the application to be able to store storage_options with the url.

from urllib.parse import parse_qs
from urllib.parse import urlencode
from urllib.parse import urlsplit


from upath.implementations.cloud import S3Path
from upath.registry import register_implementation



class S3EndpointPath(S3Path):
    
    @classmethod
    def _transform_init_args(cls, args, protocol, storage_options):
        args, protocol, storage_options = super(S3EndpointPath, cls)._transform_init_args(args, protocol, storage_options)
         
        sr = urlsplit(str(args[0]))
        query_params = parse_qs(sr.query)
        if ep := query_params.pop("endpoint_url", []):
            if len(ep) == 1:
                if "endpoint_url" in storage_options and ep[0] != storage_options["endpoint_url"]:
                    raise ValueError("incompatible endpoint_urls")
                storage_options.setdefault("endpoint_url", ep[0])
            elif len(ep) > 1:
                raise ValueError("multiple endpoint_url query parameter")
        args0 = sr._replace(query=urlencode(query_params, doseq=True)).geturl()

        return (args0, *args[1:]), protocol, storage_options

    def __str__(self):
        urlpath = super().__str__()
        if endpoint_url := self._storage_options.get("endpoint_url"):
            qs = urlencode({"endpoint_url": endpoint_url})
            urlpath = f"{urlpath}?{qs}"
        return urlpath


register_implementation("s3", S3EndpointPath, clobber=True)


if __name__ == "__main__":
    from upath import UPath


    p0 = UPath("s3://bucket/abc")
    assert isinstance(p0, S3EndpointPath)
    assert str(p0) == "s3://bucket/abc"
    assert p0.storage_options == {}
    assert p0.parts == ("bucket/", "abc")

    p1 = UPath("s3://bucket/abc?endpoint_url=https%3A%2F%2Fabc.com")
    assert str(p1) == "s3://bucket/abc?endpoint_url=https%3A%2F%2Fabc.com"
    assert p1.storage_options == {"endpoint_url": "https://abc.com"}
    assert p1.parts == ("bucket/", "abc")

    p2 = p1.joinpath("efg")
    assert isinstance(p2, S3EndpointPath)
    assert str(p2) == "s3://bucket/abc/efg?endpoint_url=https%3A%2F%2Fabc.com", str(p2)
    assert p2.storage_options == {"endpoint_url": "https://abc.com"}
    assert p2.parts == ("bucket/", "abc", "efg")

ap-- avatar Feb 02 '25 22:02 ap--

Hi, @ap-- , thank you very much for your answer. Yes, i agree with your points about generality. And Thank you also for the subclass example, very useful.

Koncopd avatar Feb 03 '25 12:02 Koncopd