should all stores have a `path` attribute that functions like a URLs path?
We are not consistent about the path attribute of stores right now. localstore has a root attribute that's basically a path (with type pathlib.path), remotestore has a v2-style explicit path attribute (with type str), zipstore has a path attribute but I think its semantics are not like remotestore.path, since zipstore.path points to a location outside the zip file system, etc.
I think we should normalize the store.path attribute along the following lines:
store.pathis a string that names a location inside the filesystem modeled by the store class. All stores have such apathattribute.- new store class instances can be created via methods like
store_instance.with_path('new_path')(completely new path) orstore_instance.join_path('relative_path') (old path + relative_path), orstore_instance / 'relative_path'` as an abbreviation of the last one. - The
StorePathclass goes away entirely, because we are giving stores their own path.
(This API is largely influenced by the yarl.URL class, which I have been using quite a bit recently)
Thoughts?
after looking into this a bit more and starting a draft implementation of this idea, I noticed that the semantics of some of the store operations varies too much across the different store implementations. If we compare MemoryStore and RemoteStore:
MemoryStorehas nopathattribute;MemoryStore.listreturns all keys in the store. This is not scalable for stores with many keys.RemoteStorehas apathattribute;RemoteStore.listonly returns the keys that haveRemotestore.pathas a leading prefix. This makes sense, because attempting to list all the keys in s3 would not be very useful.
I'm increasingly convinced that all stores should have an interface that works for RemoteStore, i.e. a path attribute that serves to narrow the scope of list / create / get actions in the key space of the store.
Just pushed some code related to this effect:
https://github.com/scverse/anndata/pull/2121/files#diff-f5b4358f8870324bb315dafb1f17a3283572741750b2a0c1326811342222fb82R183-R191
I looked into it a bit too, and the zarr-python ObjectStore has no relevant attribute, even though it would make sense to, but I think the underlying store has something: https://developmentseed.org/obstore/latest/api/store/gcs/#obstore.store.GCSStore.prefix
I'm going to hopefully enumerate the special cases, but the use-case here is generating unique ids for zarr arrays so dask doesn't have to do its (relatively slow) tokenization procedure.
i imagine ObjectStore doesn't have a path because we don't require it as part of the store ABC. Because Zarr IO is fundamentally hierarchical, I think it would be reasonable to add path / prefix semantics to the base store ABC (and then we can totally remove the StorePath class)
Id like to think about how this will impact the Icechunk store. Like the zipstore, Icechunk stores can only be opened from their root. It's not clear to me if the proposal is for the path attribute to point to the root of the store or a sub path within the store.
i'm imagining that the path / prefix attribute doesn't impose any restrictions on the API used for opening the store. it just scopes the operations that the store will perform to only those relative to the prefix. This is basically what StorePath does today, and StorePath is literally just a product type with a store and a path.