VirtualiZarr icon indicating copy to clipboard operation
VirtualiZarr copied to clipboard

Fragility of url auto-parsing logic

Open TomNicholas opened this issue 7 months ago • 8 comments

We have code to automatically interpret a URI as being a local file, S3-compatible store, or HTTPStore using Obstore.

https://github.com/zarr-developers/VirtualiZarr/blob/0bf830e63bc5bdedd87cd824d064071600d5b44a/virtualizarr/manifests/store.py#L142

But it feels a bit fragile and not really core to this package.

Can we just delegate the entire logic to another package? E.g. to obstore via https://github.com/developmentseed/obstore/blob/95e586fc6c8c9425fb682fdb9aedb16e0b0a2935/obstore/python/obstore/store.py#L656-L765

Related also to the API choice in #553, as we could choose to just make this the user's problem and stop guessing what they gave us.

cc @maxrjones

TomNicholas avatar Apr 23 '25 14:04 TomNicholas

Can we just delegate the entire logic to another package? E.g. to obstore via developmentseed/obstore@95e586f/obstore/python/obstore/store.py#L656-L765

I think this is the right call. That functionality wasn't in obstore when I started ManifestStore IIRC but now that it's released we should definitely use it.

Related also to the API choice in https://github.com/zarr-developers/VirtualiZarr/issues/553, as we could choose to just make this the user's problem and stop guessing what they gave us.

I was originally thinking we should make it the user's problem, but have come to recognize that people really rely on and expect auto-determination for storage interfaces and think we should design the API with that reality in mind.

maxrjones avatar Apr 23 '25 15:04 maxrjones

I'd tend to agree with @maxrjones here that the wide usage of fsspec has built an expectation of auto-determination in this space so if at all possible we should provide this type of interface with obvious escape hatches for user defined configuration. Though I'm not thrilled with how from_url uses passed config dicts to override the automatic behavior, I'd rather have all of this logic live in obstore than have us maintain some form of it ourselves.

While looking at from_url I did realize there is likely another deeper problem associated with our transition from fsspec to obstore. There are likely a large number of users relying on auth restricted HTTP endpoints and we won't have support for those via obstore currently https://github.com/developmentseed/obstore/issues/82.

sharkinsspatial avatar Apr 23 '25 15:04 sharkinsspatial

That functionality wasn't in obstore when I started ManifestStore IIRC but now that it's released we should definitely use it.

Ohh cool.

There are likely a large number of users relying on auth restricted HTTP endpoints and we won't have support for those via obstore currently

Uhh that does seem like a problem. I wonder if we can have some advanced API for opting in to using fsspec instead of obstore, but default to obstore

TomNicholas avatar Apr 23 '25 15:04 TomNicholas

While looking at from_url I did realize there is likely another deeper problem associated with our transition from fsspec to obstore. There are likely a large number of users relying on auth restricted HTTP endpoints and we won't have support for those via obstore currently developmentseed/obstore#82.

This is due to the underlying object_store trying to focus on emulating object stores and not being and arbitrary HTTP client. See the maintainer responses in https://github.com/apache/arrow-rs/issues/6989#issuecomment-2599641525 and https://github.com/apache/arrow-rs/issues/6989#issuecomment-2599842842, which I think make a lot of sense.

This is where obspec comes in. It defines core protocols (which currently match obstore APIs) to do object-storage like interactions. So if you needed get, get range, and list, then you'd make a new type hint for just those protocols

from obspec import GetAsync, GetRangeAsync, List

class RequiredMethodsByVirtualizarr(GetAsync, GetRangeAsync, List, Protocol):
    ....

def create_virtualizarr_reader(store: RequiredMethodsByVirtualizarr): ...

This means that obstore would work out of the box here, because it implements these protocols, but aiohttp would also work here, because it wouldn't be too hard to make an aiohttp-backed client that implements those three protocols. And then you could use the same core request logic but totally abstracted from whether it's obstore or aiohttp (or fsspec) making those requests under the hood. And then whenever you need custom HTTP authentication you could use the aiohttp reader.

The main problem I hit with obspec is that it endeavors to use structural instead of nominal subtyping. (e.g. structural means duck-typing & protocols; nominal means subclassing, where you can find an instance of a specific parent base class). But if we needed to create common base exceptions in obspec, we'd need to figure out a way to do structural exception comparison instead of (the usual) nominal exception comparison.

Perhaps we should punt the question of exceptions to later and publish an 0.1 release of obspec?

As a general matter, I think defining a protocol for the data access methods you need makes a lot of sense, whether or not that protocol is one you define yourself or one provided by obspec.

kylebarron avatar Apr 23 '25 21:04 kylebarron

Omg are you making fsspec but with an actual specification!?! (well a protocol)

(See https://github.com/fsspec/filesystem_spec/issues/1446)

TomNicholas avatar Apr 23 '25 21:04 TomNicholas

From https://github.com/fsspec/filesystem_spec/issues/1446#issuecomment-1843698132

The trouble that we are facing, is that not all implementations need to implement all functions. For instance, some filesystem might be read-only or (like HTTP) not support general listing. Further, many things are dynamic and almost all implementations come with many many options.

This IMO is the big reason for using flexible protocols instead of one massive base class, because then you find out at runtime that oh, no one of the implementations leaves something I need as NotImplemented. But with flexible protocols you can mix and match the methods provided by a class and the methods you require and let the type checker enforce that the consumer and provider are compatible.

And then the idea is that you can add middlewares of your choice, e.g. caching layers, on top of generic obspec protocols, so that the same caching will work whether it's obstore or aiohttp underneath.

Also, at least for now, obspec is a specification that just formalizes obstore's API, which is just a translation of the Rust object_store API (which I think is a very nice API). If obspec were to evolve independently of object_store's API, that might become a bit trickier to keep in sync over time (at least until object_store hits 1.0).

kylebarron avatar Apr 23 '25 22:04 kylebarron

(If obspec ever got traction and stabilized, we could swap it in to the zarr-obstore integration in https://github.com/zarr-developers/zarr-python/pull/1661 without any backwards incompatibilities)

kylebarron avatar Apr 23 '25 22:04 kylebarron

I think this is the right call. That functionality wasn't in obstore when I started ManifestStore IIRC but now that it's released we should definitely use it.

FWIW we could talk about exposing _parse_scheme publicly https://github.com/developmentseed/obstore/blob/95e586fc6c8c9425fb682fdb9aedb16e0b0a2935/obstore/python/obstore/store.py#L709. It uses the upstream ObjectStoreScheme::parse, though the source code isn't that big either: https://docs.rs/object_store/latest/src/object_store/parse.rs.html#105-139

kylebarron avatar Apr 23 '25 22:04 kylebarron

auto-parsing was removed by #601

maxrjones avatar Jun 16 '25 20:06 maxrjones