Function definition for `open_virtual_dataset` in next major release
We discussed a pre-release of VirtualiZarr 2.0 within the next week. I wanted to raise some discussion points for the public open_virtual_dataset function definition.
While not included in the docstring, the backend parameter has been added to the function def since the last release. This means there are currently two ways to set the backend, filetype and backend. In the zen of Python, I think there should just be one way to specify a backend. Could we consolidate these parameters?
The internal _create_manifest_store(...) methods include a store parameter that can be used to virtualize objects requiring credentials. There's no equivalent option in open_virtual_dataset. Could both reader_options be consolidated with virtual_backend_kwargs to avoid the use of different function signatures depending on whether a backend uses Fsspec or Obstore.
This is a nit, but filepath is really filesystem oriented language and I think VirtualiZarr is about providing cloud native access patterns, such that URL would be more appropriate.
I think any of these changes could be made with backwards compatibility + a warning about the new syntax
I'm taking a "speak now or forever hold your peace" approach to this issue and hope people don't mind 🤞. I would prefer a keep_variables argument over a drop_variables argument (similar to this request in xarray). I understand that the original design was to mimic open_dataset. The tradeoffs here involve users needing to know which variables they want to drop and many readers first virtualizing all variables, then dropping some rather than the more efficient path of only virtualizing variables that will be kept.
I think there should just be one way to specify a backend. Could we consolidate these parameters?
I agree that this has become messy. Could consolidating everything just look something like this?
from typing import Callable, TypeAlias
from virtualizarr.manifests import ManifestStore
T_VirtualReader: TypeAlias = Callable[[str], ManifestStore]
def open_virtual_dataset(
uri: str,
reader: str | T_VirtualReader = None,
reader_options: dict = {},
) -> xr.Dataset:
whose use would look like
vds = open_virtual_dataset('file.tif', reader='tiff')
from virtualizarr.readers import tiff_reader
vds = open_virtual_dataset('file.tif', reader=tiff_reader)
from other.package import third_party_reader
vds = open_virtual_dataset('file.third', reader=third_party_reader)
where reader as a string maps to some readers that are shipped with VirtualiZarr.
I would prefer a keep_variables argument over a drop_variables argument (similar to this https://github.com/pydata/xarray/issues/1754).
I don't really have a strong opinion about this, but I do think there will be pros and cons however we handle it.
@maxrjones We had a bit of conversation around this while at CNG. @TomNicholas made the great point that (please correct me if I'm paraphrasing incorrectly here Tom), that most Virtualizarr users will likely be more accustomed to lower level details than a typical Xarray user. With that in mind I think we can extend the suggestion from Tom's comment to pass more explicit arguments a bit further.
@TomNicholas example shows using a reader and a uri string, but as we've encountered over the past few weeks, managing the logic to configure obstore store instances with internal logic can be a lot to manage. This complication would likely increase once users also need local caching and custom obstore credential providers. With this in mind I think it might be a good approach to have open_virtual_dataset use an explicit obstore ReadableFile as an argument and defer all of the store instantiation and file uri determination to the user leaving us with
from typing import Callable, TypeAlias
from obstore import ReadableFile
from virtualizarr.manifests import ManifestStore
T_VirtualReader: TypeAlias = Callable[[str], ManifestStore]
def open_virtual_dataset(
file: ReadableFile,
reader: str | T_VirtualReader = None,
) -> xr.Dataset:
We can then expand the documentation (or where possible point to improved obstore docs) to show some common usage patterns for
- Instantiating an
obstoreand usingfrom_url. - Reading from remote object storage into an in-memory or local disk cache prior to using the VirtualReader by using
obstore.LocalStoreorobstore.MemoryStore.
Though this will require a bit more up front work from users I think it provides us a better separation of concerns and we can defer any issues obstore construction upstream to @kylebarron for review 😆 .
If no one has any objections I'll try to kick off a P.R this week to close https://github.com/zarr-developers/VirtualiZarr/issues/498 using this approach.
@sharkinsspatial thank you for the summary of the CNG discussions! I generally approve of moving towards more user-defined behavior. There are components of your example that do not make sense to me yet.
An example of Callable[str] in the T_VirtualReader type alias would be the HDF reader, right? Why is this interchangeable with a ManifestStore? a file-specific backend and a ManifestStore serve two different purposes.
ManifestStore requires the get_range_async method for loading variables, which is not available via a ReadableFile. How would we go from a ReadableFile to a ManifestStore? Is @kylebarron planning to add a public store property to the ReadableFile class in obstore?
@sharkinsspatial @TomNicholas I think I understand your proposal now, but think there's a couple minor changes required for it to work with ManifestStore. I updated https://github.com/zarr-developers/VirtualiZarr/pull/568 with a new proposal if you'd be willing to take a look.
open_virtual_datasetuse an explicitobstoreReadableFile
Why do you want to work with ReadableFile? It would be better to work with the raw obstore calls. ReadableFile will only be buffered as long as you don't seek. Whenever you seek the internal buffer is discarded I think.
Why do you want to work with ReadableFile? It would be better to work with the raw obstore calls.
I'm struggling to imagine what that would look like in this context - could you elaborate? Conceptually a virtualizarr reader accepts a file, obstore/spec is just needed to allow us to access the file remotely / cache it.