xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Add Index.load() and Index.chunk() methods

Open benbovy opened this issue 2 years ago • 10 comments

  • [ ] Closes #xxxx
  • [ ] Tests added
  • [ ] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • [ ] New functions/methods are listed in api.rst

As mentioned in #8124, it gives more control to custom Xarray indexes on what best to do when the Dataset / DataArray load() and chunk() counterpart methods are called.

PandasIndex.load() and PandasIndex.chunk() always return self (no action required).

For a DaskIndex, we might want to return a PandasIndex (or another non-lazy index) from load() and rebuild a DaskIndex object from chunk() (rechunk).

benbovy avatar Aug 31 '23 14:08 benbovy

Index.compute() might be a possible alternative to Index.load() #6837.

benbovy avatar Apr 16 '25 06:04 benbovy

How would this work for compute?

For load, I could see

ds.xindexes["foo"].load()

but the pattern for compute is usually:

ds2 = ds.compute()

how would that translate?

dcherian avatar Apr 16 '25 18:04 dcherian

Index.load() has different semantics than Dataset.load(): it returns an index object that will replace the existing index when calling Dataset.load(). The returned index may be self (just propagate the index), a new instance maybe of another type (e.g., convert the index to a PandasIndex) or maybe None (drop the index).

Index.load() (like other core Index API) is not intended to be end-user facing API, it is used internally by Dataset.load(), or Dataset.compute() via Dataset.load().

In general the Index method names were chosen after the Dataset methods in which they are called, but maybe Index.compute() or another name would be less confusing here?

benbovy avatar Apr 16 '25 18:04 benbovy

So if I was a user using CoordinateTransformIndex and I wanted to "load" the transformed values into memory, how would I do that?

dcherian avatar Apr 16 '25 21:04 dcherian

As an end-user you would only need to do ds.load() or ds.compute() and not care much about anything else.

It is up to the index to define how to "load" the coordinate values and maybe convert itself. For CoordinateTransfromIndex I see three options:

  1. 1D index may be converted into a PandasIndex
  2. nD index may be dropped, so Dataset.load() will fallback to Variable.load() for loading the index coordinate data
  3. add a CoordinateTransformIndex.__init__(lazy=True) option that will be used in CoordinateTransformIndex.create_variables() and that will determine the kind of variable to return

Option 3 probably makes the most sense if we still need to keep track of the underlying transform.

benbovy avatar Apr 17 '25 07:04 benbovy

I'm not sure we should conflate the two.

For example, I could have a dataset with a bunch of chunked arrays and a CoordinateTransformIndex. I might want to load the data into memory, but not realize the lazy coordinates.

And conversely, I might want to realize the CoordinateTransform values (say I've subset to a small region), but not load any chunked arrays.

I guess (3) is an option, but it's a bit of "action-at-a-distance". What is the most explicit API we can come up with?

# assuming RasterIndex over 'x', 'y' dimensions
ds.xindexes.update({"x": ds.xindexes["x"].load()})  # in-place (seems like it has to be)

dcherian avatar Apr 17 '25 13:04 dcherian

I see. Would it be reasonable to add a Dataset.load(load_coords=False) option? And add a Dataset.coords.load() method for the case of loading the coordinates but not the data? This is not the most fined-grained approach but maybe that's enough for most cases?

What is the most explicit API we can come up with?

I'd avoid ds.xindexes.update() as long-term .xindexes might be reduced to a basic mapping of index objects (https://github.com/pydata/xarray/issues/9203#issuecomment-2714774678), whereas "loading" the index should also update the index coordinates.

Alternatively:

loaded_coords = xr.Coordinates.from_xindex(ds.xindexes["x"].load())

ds.coords.update(loaded_coords)
# or
ds = ds.assign_coords(loaded_coords)

benbovy avatar Apr 17 '25 13:04 benbovy

I like loaded_coords = xr.Coordinates.from_xindex(ds.xindexes["x"].load()) as the explicit API.

dcherian avatar Apr 17 '25 14:04 dcherian

Assuming a multi-coordinate index like RasterIndex over x/y dimensions, ds.xindexes["x"].load() may look confusing: what about "y"?

Some possible ways to make it less confusing:

  • In Xarray update Indexes.__getitem__(self, key) such that key accepts a tuple. This would allow typing ds.xindexes[("x", "y")], which would basically return the same index than ds.xindexes["x"] or ds.xindexes["y"]

  • 3rd-party API such as ds.rasterix.raster_index.load() or ds.rasterix.load_raster_coords()

benbovy avatar Apr 25 '25 09:04 benbovy

I'm a bit late to this discussion, but some of this reminds me of #8607

keewis avatar Jun 13 '25 13:06 keewis