xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Wrapping a `kerchunk.Array` object directly with xarray

Open TomNicholas opened this issue 5 months ago • 3 comments

What is your issue?

In https://github.com/fsspec/kerchunk/issues/377 the idea came up of using the xarray API to concatenate arrays which represent parts of a zarr store - i.e. using xarray to kerchunk a large set of netCDF files instead of using kerchunk.combine.MultiZarrToZarr.

The idea is to make something like this work for kerchunking sets of netCDF files into zarr stores

ds = xr.open_mfdataset(
    '/my/files*.nc'
    engine='kerchunk',  # kerchunk registers an xarray IO backend that returns zarr.Array objects
    combine='nested',  # 'by_coords' would require actually reading coordinate data
    parallel=True,  # would use dask.delayed to generate reference dicts for each file in parallel
)

ds  # now wraps a bunch of zarr.Array / kerchunk.Array objects, no need for dask arrays

ds.kerchunk.to_zarr(store='out.zarr')  # kerchunk defines an xarray accessor that extracts the zarr arrays and serializes them (which could also be done in parallel if writing to parquet)

I had a go at doing this in this notebook, and in doing so discovered a few potential issues with xarray's internals.

For this to work xarray has to:

  • Wrap a kerchunk.Array object which barely defines any array API methods, including basically not supporting indexing at all,
  • Store all the information present in a kerchunked Zarr store but without ever loading any data,
  • Not create any indexes by default during dataset construction or during xr.concat,
  • Not try to do anything else that can't be defined for a kerchunk.Array.
  • Possibly we need the Lazy Indexing classes to support concatenation https://github.com/pydata/xarray/issues/4628

It's an interesting exercise in using xarray as an abstraction, with no access to real numerical values at all.

TomNicholas avatar Feb 03 '24 22:02 TomNicholas

One issue that came up around not being able to avoid creating indexes for 1D coordinates (@benbovy): https://github.com/pydata/xarray/pull/8107#discussion_r1477122555

TomNicholas avatar Feb 03 '24 22:02 TomNicholas

Another one is why does opening a dataset require indexing into it?

If I add a print(indexer) inside Variable.__getitem__ here https://github.com/pydata/xarray/blob/c9ba2be2690564594a89eb93fb5d5c4ae7a9253c/xarray/core/variable.py#L812 this happens:

In [3]: xr.tutorial.open_dataset('air_temperature')
Out[3]: BasicIndexer((slice(None, 13, None),))
BasicIndexer((slice(-12, None, None),))
BasicIndexer((slice(None, 14, None),))
BasicIndexer((slice(-13, None, None),))
BasicIndexer((slice(None, 12, None),))
BasicIndexer((slice(-11, None, None),))

why is any of that indexing happening?

EDIT: Oh it's happening when the ds gets printed...

EDIT2: False alarm about this one 😅

TomNicholas avatar Feb 03 '24 22:02 TomNicholas

It's interesting to define a minimal ConcatenatableArray and see what is required to get that to work inside xarray. Again we get problems with indexes being built unexpectedly.

Notebook for ConcatenatableArray here

TomNicholas avatar Feb 04 '24 21:02 TomNicholas