VirtualiZarr icon indicating copy to clipboard operation
VirtualiZarr copied to clipboard

Function definition for `open_virtual_dataset` in next major release

Open maxrjones opened this issue 8 months ago • 9 comments

We discussed a pre-release of VirtualiZarr 2.0 within the next week. I wanted to raise some discussion points for the public open_virtual_dataset function definition.

While not included in the docstring, the backend parameter has been added to the function def since the last release. This means there are currently two ways to set the backend, filetype and backend. In the zen of Python, I think there should just be one way to specify a backend. Could we consolidate these parameters?

The internal _create_manifest_store(...) methods include a store parameter that can be used to virtualize objects requiring credentials. There's no equivalent option in open_virtual_dataset. Could both reader_options be consolidated with virtual_backend_kwargs to avoid the use of different function signatures depending on whether a backend uses Fsspec or Obstore.

This is a nit, but filepath is really filesystem oriented language and I think VirtualiZarr is about providing cloud native access patterns, such that URL would be more appropriate.

maxrjones avatar Apr 18 '25 19:04 maxrjones

I think any of these changes could be made with backwards compatibility + a warning about the new syntax

maxrjones avatar Apr 18 '25 19:04 maxrjones

I'm taking a "speak now or forever hold your peace" approach to this issue and hope people don't mind 🤞. I would prefer a keep_variables argument over a drop_variables argument (similar to this request in xarray). I understand that the original design was to mimic open_dataset. The tradeoffs here involve users needing to know which variables they want to drop and many readers first virtualizing all variables, then dropping some rather than the more efficient path of only virtualizing variables that will be kept.

maxrjones avatar Apr 18 '25 21:04 maxrjones

I think there should just be one way to specify a backend. Could we consolidate these parameters?

I agree that this has become messy. Could consolidating everything just look something like this?

from typing import Callable, TypeAlias

from virtualizarr.manifests import ManifestStore


T_VirtualReader: TypeAlias = Callable[[str], ManifestStore]


def open_virtual_dataset(
    uri: str,
    reader: str | T_VirtualReader = None,
    reader_options: dict = {},
) -> xr.Dataset:

whose use would look like

vds = open_virtual_dataset('file.tif', reader='tiff')
from virtualizarr.readers import tiff_reader

vds = open_virtual_dataset('file.tif', reader=tiff_reader)
from other.package import third_party_reader

vds = open_virtual_dataset('file.third', reader=third_party_reader)

where reader as a string maps to some readers that are shipped with VirtualiZarr.

TomNicholas avatar Apr 18 '25 21:04 TomNicholas

I would prefer a keep_variables argument over a drop_variables argument (similar to this https://github.com/pydata/xarray/issues/1754).

I don't really have a strong opinion about this, but I do think there will be pros and cons however we handle it.

TomNicholas avatar Apr 18 '25 21:04 TomNicholas

@maxrjones We had a bit of conversation around this while at CNG. @TomNicholas made the great point that (please correct me if I'm paraphrasing incorrectly here Tom), that most Virtualizarr users will likely be more accustomed to lower level details than a typical Xarray user. With that in mind I think we can extend the suggestion from Tom's comment to pass more explicit arguments a bit further.

@TomNicholas example shows using a reader and a uri string, but as we've encountered over the past few weeks, managing the logic to configure obstore store instances with internal logic can be a lot to manage. This complication would likely increase once users also need local caching and custom obstore credential providers. With this in mind I think it might be a good approach to have open_virtual_dataset use an explicit obstore ReadableFile as an argument and defer all of the store instantiation and file uri determination to the user leaving us with

from typing import Callable, TypeAlias

from obstore import ReadableFile
from virtualizarr.manifests import ManifestStore


T_VirtualReader: TypeAlias = Callable[[str], ManifestStore]


def open_virtual_dataset(
    file: ReadableFile,
    reader: str | T_VirtualReader = None,
) -> xr.Dataset:

We can then expand the documentation (or where possible point to improved obstore docs) to show some common usage patterns for

  1. Instantiating an obstore and using from_url.
  2. Reading from remote object storage into an in-memory or local disk cache prior to using the VirtualReader by using obstore.LocalStore or obstore.MemoryStore.

Though this will require a bit more up front work from users I think it provides us a better separation of concerns and we can defer any issues obstore construction upstream to @kylebarron for review 😆 .

If no one has any objections I'll try to kick off a P.R this week to close https://github.com/zarr-developers/VirtualiZarr/issues/498 using this approach.

sharkinsspatial avatar May 05 '25 20:05 sharkinsspatial

@sharkinsspatial thank you for the summary of the CNG discussions! I generally approve of moving towards more user-defined behavior. There are components of your example that do not make sense to me yet.

An example of Callable[str] in the T_VirtualReader type alias would be the HDF reader, right? Why is this interchangeable with a ManifestStore? a file-specific backend and a ManifestStore serve two different purposes.

ManifestStore requires the get_range_async method for loading variables, which is not available via a ReadableFile. How would we go from a ReadableFile to a ManifestStore? Is @kylebarron planning to add a public store property to the ReadableFile class in obstore?

maxrjones avatar May 05 '25 22:05 maxrjones

@sharkinsspatial @TomNicholas I think I understand your proposal now, but think there's a couple minor changes required for it to work with ManifestStore. I updated https://github.com/zarr-developers/VirtualiZarr/pull/568 with a new proposal if you'd be willing to take a look.

maxrjones avatar May 05 '25 22:05 maxrjones

open_virtual_dataset use an explicit obstore ReadableFile

Why do you want to work with ReadableFile? It would be better to work with the raw obstore calls. ReadableFile will only be buffered as long as you don't seek. Whenever you seek the internal buffer is discarded I think.

kylebarron avatar May 08 '25 20:05 kylebarron

Why do you want to work with ReadableFile? It would be better to work with the raw obstore calls.

I'm struggling to imagine what that would look like in this context - could you elaborate? Conceptually a virtualizarr reader accepts a file, obstore/spec is just needed to allow us to access the file remotely / cache it.

TomNicholas avatar May 09 '25 00:05 TomNicholas