Redefine Virtual Readers as `func(filepath) -> ManifestStore`
Once we've completed https://github.com/zarr-developers/VirtualiZarr/issues/473, the responsibility of each reader will essentially be reduced to just creating a fully virtual ManifestStore instance from the given filepath.
This means that we could redefine what a "virtual reader" is, as simply a filetype-specific function of the form
def virtual_reader(filepath: str) -> ManifestStore:
...
where there is one of these readers for TIFF files, one for HDF files, etc. etc.
As described in #473, loadable_variables is then not ever even passed to the reader function, as instead it's always handled by virtualizarr in some subsequent code like this:
ds_virtual = manifeststore.to_virtual_ds()
ds_loadable = xr.open_zarr(manifeststore, drop_variables=list(ds_virtual.variables))
vds = xr.merge(ds_virtual, ds_loadable)
This is mostly great, BUT it does cause some issues for the special set of readers who read from other virtual references formats (i.e. DMR++ and Kerchunk) - see https://github.com/zarr-developers/VirtualiZarr/pull/477#issuecomment-2744448626.
Another issue it causes is that it becomes tricky to solve https://github.com/zarr-developers/VirtualiZarr/issues/489 - when reading inlined data the kerchunk reader would prefer to return a loadable variable, but this manifeststore-first architecture doesn't really allow that. Unless we used the ability of obstore to wrap an MemoryStore too...?
cc @maxrjones
This would also mean we had no need for a more complex entrypoint system like https://github.com/zarr-developers/VirtualiZarr/issues/245
How do you think we should handle storage configuration in this new virual reader structure? We could have users separately configure on ObjectStore instance and pass that to the reader:
def virtual_reader(filepath: str, store: ObjectStore) -> ManifestStore:
...
Another option would be to accept arbitrary kwargs for storage creation and try to infer the storage type from the url:
def virtual_reader(url: str, storage_options: dict) -> ManifestStore:
...
Or perhaps we could define some sort of configuration container similar to Icechunk's approach.
The first option (separate ObjectStore configuration) would certainly be the simplest to implement and maintain. The main downsides are the users would need to understand a bit about obstore (though it has good documentation) and it would probably lock VirtualiZarr into using obstore for accessing files, whereas the second or third could be implemented to take either fsspec or obstore configration options. To me, supporting multiple object/file abstractions sounds unnecessarily difficult to maintain so I'd currently lean to the first approach of explicit store creation by the user.
I think explicitly creating a store seems fine
Also could we abstract this away for the user? So that developers of new backends have to understand obstore but people calling open_virtual_dataset('known_filetype.nc') don't?
Also could we abstract this away for the user? So that developers of new backends have to understand
obstorebut people callingopen_virtual_dataset('known_filetype.nc')don't?
I've been thinking about this a bit and I think an appropriate balance could be abstracting it away for local stores or object stores that do not need extra configuration. Inspired by how Zarr handles the prototype=None default for get methods, we could have a function signature of def virtual_reader(filepath: str, store: ObjectStore = None) -> ManifestStore:. If store=None we would call default_object_store() to infer the protocol and prefix from the filepath, then create the relevant store instance.
The caveat is that this would only work for local, memory, or public-access S3/GCP/Azure files. In order to support cases where storage configuration is needed, we'd have to keep something like storage_options: dict = {} in the function signature. I personally don't like pass through kwarg dicts due to them being harder to document and type, but I also understand why they exist.
A totally different approach would be to use config settings similar to Icechunk.
Or perhaps we could define some sort of configuration container similar to Icechunk's approach.
Obstore already defines "configuration containers" in the sense of S3Config. You can just reuse that instead of defining your own configuration.
If
store=Nonewe would calldefault_object_store()to infer the protocol and prefix from the filepath, then create the relevant store instance.
I realised I now need something like default_object_store() to continue with the refactor in #547 (comment).
There's another subtlety exposed there, which is that it's perfectly reasonable for a user to create a ManifestStore that they never plan to read data from, and hence doesn't technically need to be backed by any obstore store. An example would be using VirtualiZarr to translate pre-existing Kerchunk/DMR++ references into Icechunk. But perhaps we should just make a default object store in this situation anyway.
Obstore already defines "configuration containers" in the sense of S3Config. You can just reuse that instead of defining your own configuration.
These seem like basically the same things as Icechunk's VirtualChunkContainers. We've kind of got to a point here where both VirtualiZarr and Icechunk have independent implementations of in-memory zarr stores containing virtual references, backed by rust-powered IO libraries that can read from remote object storage. I'm not sure how I feel about this.
If store=None we would call default_object_store() to infer the protocol and prefix from the filepath, then create the relevant store instance.
I realised I now need something like default_object_store() to continue with the refactor in #547 (comment).
I can take a go at building this today.
There's another subtlety exposed there, which is that it's perfectly reasonable for a user to create a ManifestStore that they never plan to read data from, and hence doesn't technically need to be backed by any obstore store. An example would be using VirtualiZarr to translate pre-existing Kerchunk/DMR++ references into Icechunk. But perhaps we should just make a default object store in this situation anyway.
I can understand why you'd want to use ManifestStore in the case in which additional variables should be loaded, but I don't understand why you'd use it as a container for ManifestArrays rather than going directly to a dataset. Can you explain the motivation?
Obstore already defines "configuration containers" in the sense of S3Config. You can just reuse that instead of defining your own configuration.
These seem like basically the same things as Icechunk's VirtualChunkContainers. We've kind of got to a point here where both VirtualiZarr and Icechunk have independent implementations of in-memory zarr stores containing virtual references, backed by rust-powered IO libraries that can read from remote object storage. I'm not sure how I feel about this.
I view ManifestStore as a really lightweight, read-only store implementation which seems sufficiently distinct from all the functionality in Icechunk's store implementation. The potential redundancy or divergence to more thoroughly consider is between obstore and icechunk, but this isn't really the place for those discussions.
I can take a go at building this today.
No pressure to do it today! Would be useful to have during the latter half of next week though.
going directly to a dataset
You make a good point that the ManifestStore is less useful when you're not planning to load any variables, but if I went straight to a dataset then I would have one special codepath only for kerchunk / DMR++ files, while every other reader went via a ManifestStore. It would break the definition of reader == func(filepath) -> ManifestStore.
I view
ManifestStoreas a really lightweight, read-only store implementation which seems sufficiently distinct from all the functionality in Icechunk's store implementation. The potential redundancy or divergence to more thoroughly consider is between obstore and icechunk, but this isn't really the place for those discussions.
👍 I'm just trying to lay out thoughts about the commonalities as I realise them.
Or perhaps we could define some sort of configuration container similar to Icechunk's approach.
Obstore already defines "configuration containers" in the sense of
S3Config. You can just reuse that instead of defining your own configuration.
We're looking to use these as a runtime container rather than just for type-checking. Would you consider making S3Config importable at runtime?
Completed by #601