xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Automatically create `xindex`?

Open max-sixty opened this issue 1 year ago • 6 comments

Is your feature request related to a problem?

I'm trying to use xindex more. Currently, trying to select values using coordinates that haven't been explicitly indexed via set_xindex() raises:

ds = xr.tutorial.open_dataset("air_temperature").assign_coords(lat2=lambda x: x.lat)

ds
# Output:
<xarray.Dataset> Size: 31MB
Dimensions:  (lat: 25, time: 2920, lon: 53)
Coordinates:
  * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
  * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * time     (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
    lat2     (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
Data variables:
    air      (time, lat, lon) float64 31MB ...
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

# Attempting to select using the unindexed coordinate raises an error:
ds.sel(lat2=75)
# Output:
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[20], line 1
----> 1 ds.sel(lat2=75)

File ~/workspace/xarray/xarray/core/dataset.py:3223, in Dataset.sel(self, indexers, method, tolerance, drop, **indexers_kwargs)
   3155 """Returns a new dataset with each array indexed by tick labels
   3156 along the specified dimension(s).
   3157
   (...)
   3220
   3221 """
   3222 indexers = either_dict_or_kwargs(indexers, indexers_kwargs, "sel")
-> 3223 query_results = map_index_queries(
   3224     self, indexers=indexers, method=method, tolerance=tolerance
   3225 )
   3227 if drop:
   3228     no_scalar_variables = {}

File ~/workspace/xarray/xarray/core/indexing.py:186, in map_index_queries(obj, indexers, method, tolerance, **indexers_kwargs)
    183     options = {"method": method, "tolerance": tolerance}
    185 indexers = either_dict_or_kwargs(indexers, indexers_kwargs, "map_index_queries")
--> 186 grouped_indexers = group_indexers_by_index(obj, indexers, options)
    188 results = []
    189 for index, labels in grouped_indexers:

File ~/workspace/xarray/xarray/core/indexing.py:145, in group_indexers_by_index(obj, indexers, options)
    143     grouped_indexers[index_id][key] = label
    144 elif key in obj.coords:
--> 145     raise KeyError(f"no index found for coordinate {key!r}")
    146 elif key not in obj.dims:
    147     raise KeyError(
    148         f"{key!r} is not a valid dimension or coordinate for "
    149         f"{obj.__class__.__name__} with dimensions {obj.dims!r}"
    150     )

KeyError: "no index found for coordinate 'lat2'"

After explicitly setting the index, it works as expected:

ds.set_xindex('lat2').sel(lat2=75)
# Output:
<xarray.Dataset> Size: 1MB
Dimensions:  (time: 2920, lon: 53)
Coordinates:
    lat      float32 4B 75.0
  * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * time     (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
    lat2     float32 4B 75.0
Data variables:
    air      (time, lon) float64 1MB ...
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

It's a bit annoying — frequently I attempt to select something, realize it doesn't have an index, add the .set_xindex call, try and remember to add each one at object creation, feel like xarray isn't being as helpful as it could be.

Describe the solution you'd like

Could we instead set the xindex automatically when calling .sel

Possibly we want to force the user to create this once, rather than paying the cost of creating a new index on each call? But OTOH it seems relatively cheap?

%timeit ds.assign_coords(lat2=ds.lat + 2).set_xindex('lat2')

349 µs ± 6.97 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

(I guess it could be possible to update a cache in place, and then creating a new index from the cache would be very cheap. Though also possibly that's a source of quite confusing behavior if our implementation is in any way wrong / people are sharing objects across threads etc — i.e. the principle of "don't update in place" is useful)

Describe alternatives you've considered

A set_xindex(...) param (i.e. literally an ellipsis ...) that just creates all the indexes that it can, and folks could call after creating an object?

Additional context

No response

max-sixty avatar Nov 01 '24 18:11 max-sixty

Somehow I remember that this came up already a year ago or so. But I cannot seem to find the issue...

I think that this would be a great addition.

headtr1ck avatar Nov 01 '24 19:11 headtr1ck

👍 for automatically creating indexes when needed.

I would not modify the xarray object in place. Users can do this if they need the performance gains.

shoyer avatar Nov 01 '24 21:11 shoyer

One quick thought: should we add them when creating the object?

max-sixty avatar Nov 02 '24 19:11 max-sixty

Might be related: https://github.com/pydata/xarray/issues/8028

headtr1ck avatar Nov 03 '24 10:11 headtr1ck

I agree that explicitly setting the index can be a bit annoying sometimes. I'm a little worried about automatically creating (even temporary) indexes, though.

The created index would be a default PandasIndex I guess, which is currently OK since custom Xarray indexes are not yet widely used. I hope the ecosystem will provide many useful kinds of indexes in the future, even though this could make things a bit harder to interpret (like subtle differences in .sel results depending on the index). Implicitly created indexes could make things worse. For example: ds.sel(lon=...) using an explicitly set spatial (periodic) index vs. ds.sel(lon2=...) creating a default (non-periodic) pandas index on the fly.

What about dimensions with billions of elements? It's an edge case but it has been discussed a few times. It is now possible to create new datasets with no index at all... How can we also avoid the creation of temporary indexes in that case?

(More theoretically, this contradicts a bit the goal of the index refactor that was to make indexes "explicitly" part of the Xarray data model)

One quick thought: should we add them when creating the object?

I think I'd prefer this. However, we would first need to support selection and alignment with multiple indexes sharing common dimensions.

benbovy avatar Dec 16 '24 23:12 benbovy

We've discussed about this with @dcherian @keewis and @ianhi. I changed my mind, this is actually a good idea :).

I would not modify the xarray object in place. Users can do this if they need the performance gains.

I agree.

benbovy avatar Dec 10 '25 11:12 benbovy