xarray
xarray copied to clipboard
Wrapping a `kerchunk.Array` object directly with xarray
What is your issue?
In https://github.com/fsspec/kerchunk/issues/377 the idea came up of using the xarray API to concatenate arrays which represent parts of a zarr store - i.e. using xarray to kerchunk a large set of netCDF files instead of using kerchunk.combine.MultiZarrToZarr
.
The idea is to make something like this work for kerchunking sets of netCDF files into zarr stores
ds = xr.open_mfdataset(
'/my/files*.nc'
engine='kerchunk', # kerchunk registers an xarray IO backend that returns zarr.Array objects
combine='nested', # 'by_coords' would require actually reading coordinate data
parallel=True, # would use dask.delayed to generate reference dicts for each file in parallel
)
ds # now wraps a bunch of zarr.Array / kerchunk.Array objects, no need for dask arrays
ds.kerchunk.to_zarr(store='out.zarr') # kerchunk defines an xarray accessor that extracts the zarr arrays and serializes them (which could also be done in parallel if writing to parquet)
I had a go at doing this in this notebook, and in doing so discovered a few potential issues with xarray's internals.
For this to work xarray has to:
- Wrap a
kerchunk.Array
object which barely defines any array API methods, including basically not supporting indexing at all, - Store all the information present in a kerchunked Zarr store but without ever loading any data,
- Not create any indexes by default during dataset construction or during
xr.concat
, - Not try to do anything else that can't be defined for a
kerchunk.Array
. - Possibly we need the Lazy Indexing classes to support concatenation https://github.com/pydata/xarray/issues/4628
It's an interesting exercise in using xarray as an abstraction, with no access to real numerical values at all.
One issue that came up around not being able to avoid creating indexes for 1D coordinates (@benbovy): https://github.com/pydata/xarray/pull/8107#discussion_r1477122555
Another one is why does opening a dataset require indexing into it?
If I add a print(indexer)
inside Variable.__getitem__
here
https://github.com/pydata/xarray/blob/c9ba2be2690564594a89eb93fb5d5c4ae7a9253c/xarray/core/variable.py#L812
this happens:
In [3]: xr.tutorial.open_dataset('air_temperature')
Out[3]: BasicIndexer((slice(None, 13, None),))
BasicIndexer((slice(-12, None, None),))
BasicIndexer((slice(None, 14, None),))
BasicIndexer((slice(-13, None, None),))
BasicIndexer((slice(None, 12, None),))
BasicIndexer((slice(-11, None, None),))
why is any of that indexing happening?
EDIT: Oh it's happening when the ds
gets printed...
EDIT2: False alarm about this one 😅
It's interesting to define a minimal ConcatenatableArray
and see what is required to get that to work inside xarray. Again we get problems with indexes being built unexpectedly.