Serializing s3-compatible endpoint url within path strings
I want to discuss this.
We manage a lot of different paths and need also to support paths in s3-compatible endpoint urls along with normal s3 paths. I understand that an endpoint url can be passed easily via UPath("s3://bucket/key", endpoint_url=endpoint_url).
However it really feels that the endpoint url is a part of the path itself, and it would be very useful to have something like UPath("s3://bucket?endpoint_url/key") or UPath("s3://bucket/key?endpoint_url"), also storing such path strings are easier, and differentiating them from normal s3 paths. What do you think?
Hi @Koncopd,
Thanks for opening the issue ❤
It's possible to do what you want by subclassing S3FileSystem and S3Path and registering both in your code. I'll provide a code example later today.
I also understand that sometimes it's just more convenient to have a single string representation for a path. It makes passing around everything just a bit easier.
In general though, I would discourage it. If you want to be as flexible as possible for supporting filesystem_spec filesystems, you need to accept that a path is clearly defined via three items:
the filesystem
-
protocol(indirectly defines the filesystem_spec AbstractFileSystem subclass) -
storage_options(defines how the filesystem is instantiated, and potentially contains chained filesystems)
and the path on the filesystem
-
path(i.e. some string that points to a resource)
So it's better and more flexible if your application is designed in a way to store the three pieces of information of a unique path. The fact that for some filesystems, some of the storage_options can be serialized into the path, is convenient, but it's easy to construct cases where specific storage_options either can't easily be serialized to a string, or where a let's say query-parameter serialization would introduce ambiguity because the options collide.
If we could collect a few examples, that require a better way of serializing protocol storage_options and path then we could potentially work out a better solution.
Some time ago, when I had a similar issue, I experimented with serializing to json and just storing the json object in my application code: https://github.com/Bayer-Group/pado/blob/635d7b8b57e527254d6302730100a6dab5a2095f/pado/io/files.py#L351-L386 but that is not universal either, because there are (edge) cases where your only option is to pickle the storage_options...
P.S.: there are storage_args too, but so far I think I haven't seen an example where you couldn't map them to keyword arguments.
Here's an example how this could be implemented. Serializing the endpoint_url as a query_parameter didn't need subclassing S3FileSystem. That might be needed if you want to add the endpoint_url to the bucket somehow.
For final use this would need a bunch of more tests of course, since there could be a few strange edge cases dependent on the different path operations.
And as I said, I would recommend just designing the application to be able to store storage_options with the url.
from urllib.parse import parse_qs
from urllib.parse import urlencode
from urllib.parse import urlsplit
from upath.implementations.cloud import S3Path
from upath.registry import register_implementation
class S3EndpointPath(S3Path):
@classmethod
def _transform_init_args(cls, args, protocol, storage_options):
args, protocol, storage_options = super(S3EndpointPath, cls)._transform_init_args(args, protocol, storage_options)
sr = urlsplit(str(args[0]))
query_params = parse_qs(sr.query)
if ep := query_params.pop("endpoint_url", []):
if len(ep) == 1:
if "endpoint_url" in storage_options and ep[0] != storage_options["endpoint_url"]:
raise ValueError("incompatible endpoint_urls")
storage_options.setdefault("endpoint_url", ep[0])
elif len(ep) > 1:
raise ValueError("multiple endpoint_url query parameter")
args0 = sr._replace(query=urlencode(query_params, doseq=True)).geturl()
return (args0, *args[1:]), protocol, storage_options
def __str__(self):
urlpath = super().__str__()
if endpoint_url := self._storage_options.get("endpoint_url"):
qs = urlencode({"endpoint_url": endpoint_url})
urlpath = f"{urlpath}?{qs}"
return urlpath
register_implementation("s3", S3EndpointPath, clobber=True)
if __name__ == "__main__":
from upath import UPath
p0 = UPath("s3://bucket/abc")
assert isinstance(p0, S3EndpointPath)
assert str(p0) == "s3://bucket/abc"
assert p0.storage_options == {}
assert p0.parts == ("bucket/", "abc")
p1 = UPath("s3://bucket/abc?endpoint_url=https%3A%2F%2Fabc.com")
assert str(p1) == "s3://bucket/abc?endpoint_url=https%3A%2F%2Fabc.com"
assert p1.storage_options == {"endpoint_url": "https://abc.com"}
assert p1.parts == ("bucket/", "abc")
p2 = p1.joinpath("efg")
assert isinstance(p2, S3EndpointPath)
assert str(p2) == "s3://bucket/abc/efg?endpoint_url=https%3A%2F%2Fabc.com", str(p2)
assert p2.storage_options == {"endpoint_url": "https://abc.com"}
assert p2.parts == ("bucket/", "abc", "efg")
Hi, @ap-- , thank you very much for your answer. Yes, i agree with your points about generality. And Thank you also for the subclass example, very useful.