xarray
xarray copied to clipboard
Add Index.load() and Index.chunk() methods
- [ ] Closes #xxxx
- [ ] Tests added
- [ ] User visible changes (including notable bug fixes) are documented in
whats-new.rst - [ ] New functions/methods are listed in
api.rst
As mentioned in #8124, it gives more control to custom Xarray indexes on what best to do when the Dataset / DataArray load() and chunk() counterpart methods are called.
PandasIndex.load() and PandasIndex.chunk() always return self (no action required).
For a DaskIndex, we might want to return a PandasIndex (or another non-lazy index) from load() and rebuild a DaskIndex object from chunk() (rechunk).
Index.compute() might be a possible alternative to Index.load() #6837.
How would this work for compute?
For load, I could see
ds.xindexes["foo"].load()
but the pattern for compute is usually:
ds2 = ds.compute()
how would that translate?
Index.load() has different semantics than Dataset.load(): it returns an index object that will replace the existing index when calling Dataset.load(). The returned index may be self (just propagate the index), a new instance maybe of another type (e.g., convert the index to a PandasIndex) or maybe None (drop the index).
Index.load() (like other core Index API) is not intended to be end-user facing API, it is used internally by Dataset.load(), or Dataset.compute() via Dataset.load().
In general the Index method names were chosen after the Dataset methods in which they are called, but maybe Index.compute() or another name would be less confusing here?
So if I was a user using CoordinateTransformIndex and I wanted to "load" the transformed values into memory, how would I do that?
As an end-user you would only need to do ds.load() or ds.compute() and not care much about anything else.
It is up to the index to define how to "load" the coordinate values and maybe convert itself. For CoordinateTransfromIndex I see three options:
- 1D index may be converted into a PandasIndex
- nD index may be dropped, so
Dataset.load()will fallback toVariable.load()for loading the index coordinate data - add a
CoordinateTransformIndex.__init__(lazy=True)option that will be used inCoordinateTransformIndex.create_variables()and that will determine the kind of variable to return
Option 3 probably makes the most sense if we still need to keep track of the underlying transform.
I'm not sure we should conflate the two.
For example, I could have a dataset with a bunch of chunked arrays and a CoordinateTransformIndex. I might want to load the data into memory, but not realize the lazy coordinates.
And conversely, I might want to realize the CoordinateTransform values (say I've subset to a small region), but not load any chunked arrays.
I guess (3) is an option, but it's a bit of "action-at-a-distance". What is the most explicit API we can come up with?
# assuming RasterIndex over 'x', 'y' dimensions
ds.xindexes.update({"x": ds.xindexes["x"].load()}) # in-place (seems like it has to be)
I see. Would it be reasonable to add a Dataset.load(load_coords=False) option? And add a Dataset.coords.load() method for the case of loading the coordinates but not the data? This is not the most fined-grained approach but maybe that's enough for most cases?
What is the most explicit API we can come up with?
I'd avoid ds.xindexes.update() as long-term .xindexes might be reduced to a basic mapping of index objects (https://github.com/pydata/xarray/issues/9203#issuecomment-2714774678), whereas "loading" the index should also update the index coordinates.
Alternatively:
loaded_coords = xr.Coordinates.from_xindex(ds.xindexes["x"].load())
ds.coords.update(loaded_coords)
# or
ds = ds.assign_coords(loaded_coords)
I like loaded_coords = xr.Coordinates.from_xindex(ds.xindexes["x"].load()) as the explicit API.
Assuming a multi-coordinate index like RasterIndex over x/y dimensions, ds.xindexes["x"].load() may look confusing: what about "y"?
Some possible ways to make it less confusing:
-
In Xarray update
Indexes.__getitem__(self, key)such thatkeyaccepts a tuple. This would allow typingds.xindexes[("x", "y")], which would basically return the same index thands.xindexes["x"]ords.xindexes["y"] -
3rd-party API such as
ds.rasterix.raster_index.load()ords.rasterix.load_raster_coords()
I'm a bit late to this discussion, but some of this reminds me of #8607