VirtualiZarr icon indicating copy to clipboard operation
VirtualiZarr copied to clipboard

Make xarray an optional dependency?

Open TomNicholas opened this issue 9 months ago • 4 comments

I realised that the in-progess ManifestStore refactor would actually allow us to separate concerns so much that we could potentially make xarray an optional dependency, where you only need xarray installed if you want to use its API to manipulate virtual zarr stores (e.g. by concatenating them).

The result could work like this:

# use a virtual reader directly - no xarray needed
ms: ManifestStore = manifeststore_from_hdf('file.nc')

# write to some virtual references format directly - no xarray needed
# this would use `IcechunkStore.set_virtual_refs()` as it currently does
ms.to_icechunk(icechunkstore)

or if you want to work in xarray space you can move to it:

# xarray required to convert to virtual dataset representation
vds: xr.Dataset = ms.to_virtual_dataset(loadable_variables=...)

# (or just go straight there using our existing API)
vds: xr.Dataset = vz.open_virtual_dataset('file.nc', reader=manifeststore_from_hdf, loadable_variables=...)

# xarray required to do manipulating in xarray space
vds_combined: xr.Dataset = xr.concatenate(vds1, vds2, ...)

# write to some virtual references format - xarray required to write the non-virtual variables
# this could convert the virtual variables to a `ManifestStore` first as well as using `Dataset.to_zarr(icechunkstore)` for the loadable variables as it currently does
vds.to_icechunk(icechunkstore)

Advantages:

  1. Total separation of concerns between virtualizing files into the zarr data model and manipulating them using the xarray data model (this would probably be helpful for fill_value and CF-related stuff too),
  2. Can create virtual zarr references for data that xarray cannot even represent (e.g. multiple arrays with non-alignable dimensions in the same group).

Disadvantages:

  1. Might be less clear for non-expert users, because there are now two ways to read and write references. I still think we would present the xarray interface as the standard UI, we would just mention that this is possible in a developers section of the docs, as ManifestStore is only supposed to be developer API anyway.

TomNicholas avatar Mar 31 '25 20:03 TomNicholas

I think namespace packages could help with the separation of concerns, in addition to the optional dependencies approach. For example:

virtualizarr.manifests - In-memory manifest representations (Chunk, Array, Group, Store) virtualizarr.readers - All "canonical" reader functions (HDF5, TIFF, etc.) virtualizarr.xarray - Access to Xarray's manipulation functionality virtualizarr.writers - Write to Icechunk or Kerchunk

Might be less clear for non-expert users, because there are now two ways to read and write references. I still think we would present the xarray interface as the standard UI, we would just mention that this is possible in a developers section of the docs, as ManifestStore is only supposed to be developer API anyway.

The fact that we have two ways to read and write references is due to our use of accessors and the proposed addition of to_icechunk as a class method on ManifestStore rather than as a function. We could instead lean heavily into a functional API, which would make it so that there's only one approach and would eliminate the accessor downsides that @chuckwondo brought up. Of course this would be a much larger change than just making Xarray optional, but now's the time to decide for or against big changes as we're preparing for v2.0.

So the primary user facing functions would be:

def virtualize_to_zarr(filepath: str) -> ManifestStore: # or virtualize_to_store()
    ...
def virtualize_to_xarray(filepath: str) -> Dataset | DataTree: # or virtualize_to_dataset() + virtualize_to_datatree()
    ms = virtualize_to_zarr(filepath)
    return to_xarray(ms)
def to_icechunk(virtual_data: ManifestStore | Dataset | DataTree , store: IcechunkStore):
    ...

Not quite sure where functionality like renaming paths or extracting a coordinate from a filename would go in this model. Totally fine with being shot down on this idea, but thought it could be helpful for discussion.

maxrjones avatar Apr 02 '25 01:04 maxrjones

namespace packages

I like this idea, with the one exception that xarray itself is our namespace for xarray manipulations.

The fact that we have two ways to read and write references is due to our use of accessors and the proposed addition of to_icechunk as a class method on ManifestStore rather than as a function.

You're right that that's the reason we would have two callables, but it's not the reason we would have two codepaths. Two codepaths are inevitable once we have an explicit separation between ManifestStore and virtual Datasets - they each have (slightly) different data models so each require different codepaths to serialize.

We could instead lean heavily into a functional API

to_icechunk(virtual_data: ManifestStore | Dataset | DataTree , store: IcechunkStore):

The reason I'm not a fan of this idea is the same reason I wasn't a fan in #389 - it doesn't dispatch the correct callable for the given type. Serializing ManifestStore vs Dataset vs DataTree all require similar but slightly different functions, with slightly different APIs (e.g. DataTree.to_zarr() has a write_inherited_coords kwarg that makes no sense for Dataset.to_zarr()). Smushing them all into one function like you're suggesting here would lead to a confusing set of overloaded kwarg signatures.

Therefore to do it the functional way you really need one function per type (ManifestStore / DataArray / Dataset / DataTree), whose name is redundant with its argument type (i.e. dataset_to_icechunk(ds: Dataset)), instead of just one method for all of them that has a slightly different API depending on what type you call the method on. I feel like this is a situation in which a method makes more sense than overloaded functions.

TomNicholas avatar Apr 02 '25 14:04 TomNicholas

Regardless it seems like you're a fan of the idea of making xarray an optional dependency @maxrjones ? (With #389 as a fairly separable issue)

TomNicholas avatar Apr 02 '25 14:04 TomNicholas

Regardless it seems like you're a fan of the idea of making xarray an optional dependency @maxrjones ? (With #389 as a fairly separable issue)

Yes, if it's done in combination with namespace packages to make it simpler for users/developers to understand where different optional dependencies should be used.

maxrjones avatar Apr 02 '25 16:04 maxrjones