xarray
xarray copied to clipboard
Invisible differences between arrays using IntervalIndex
What happened:
I have two DataArrays that each have a coordinate constructed with pandas.interval_range. In one case I pass the interval_range directly, in the other case I call .to_numpy() first. The two DataArrays look identical but aren't. This can lead to hard-to-find bugs, because behaviour is not identical: the former supports indexing whereas the latter doesn't.
What you expected to happen:
I expect two arrays that appear identical to behave identically. If they don't behave identically then there should be some way to tell the difference (apart from equals, which tells me they are different but not how).
Minimal Complete Verifiable Example:
import xarray
import pandas
da1 = xarray.DataArray([0, 1, 2], dims=("x",), coords={"x":
pandas.interval_range(0, 2, 3)})
da2 = xarray.DataArray([0, 1, 2], dims=("x",), coords={"x":
pandas.interval_range(0, 2, 3).to_numpy()})
print(repr(da1) == repr(da2))
print(repr(da1.x) == repr(da2.x))
print(da1.x.dtype == da2.x.dtype)
# identical? No:
print(da1.equals(da2))
print(da1.x.equals(da2.x))
# in particular:
da1.sel(x=1) # works
da2.sel(x=1) # fails
Results in:
True
True
True
False
False
Traceback (most recent call last):
File "/data/gholl/miniconda3/envs/py38/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2895, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 1
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "mwe105.py", line 19, in <module>
da2.sel(x=1) # fails
File "/data/gholl/miniconda3/envs/py38/lib/python3.8/site-packages/xarray/core/dataarray.py", line 1143, in sel
ds = self._to_temp_dataset().sel(
File "/data/gholl/miniconda3/envs/py38/lib/python3.8/site-packages/xarray/core/dataset.py", line 2105, in sel
pos_indexers, new_indexes = remap_label_indexers(
File "/data/gholl/miniconda3/envs/py38/lib/python3.8/site-packages/xarray/core/coordinates.py", line 397, in remap_label_indexers
pos_indexers, new_indexes = indexing.remap_label_indexers(
File "/data/gholl/miniconda3/envs/py38/lib/python3.8/site-packages/xarray/core/indexing.py", line 275, in remap_label_indexers
idxr, new_idx = convert_label_indexer(index, label, dim, method, tolerance)
File "/data/gholl/miniconda3/envs/py38/lib/python3.8/site-packages/xarray/core/indexing.py", line 196, in convert_label_indexer
indexer = index.get_loc(label_value, method=method, tolerance=tolerance)
File "/data/gholl/miniconda3/envs/py38/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2897, in get_loc
raise KeyError(key) from err
KeyError: 1
Additional context
I suppose this happens because under the hood xarray does something clever to support pandas-style indexing even though the coordinate variable appears like a numpy array with an object dtype, and that this cleverness is lost if the object is already converted to a numpy array. But there is, as far as I can see, no way to tell the difference once the objects have been created.
Environment:
Output of xr.show_versions()
INSTALLED VERSIONS
commit: None python: 3.8.6 | packaged by conda-forge | (default, Oct 7 2020, 19:08:05) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 4.12.14-lp150.12.82-default machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: en_GB.UTF-8 libhdf5: 1.10.6 libnetcdf: 4.7.4
xarray: 0.16.1 pandas: 1.1.4 numpy: 1.19.4 scipy: 1.5.3 netCDF4: 1.5.4 pydap: None h5netcdf: 0.8.1 h5py: 3.1.0 Nio: None zarr: 2.5.0 cftime: 1.2.1 nc_time_axis: None PseudoNetCDF: None rasterio: 1.1.7 cfgrib: None iris: None bottleneck: None dask: 2.30.0 distributed: 2.30.1 matplotlib: 3.3.2 cartopy: 0.18.0 seaborn: None numbagg: None pint: None setuptools: 49.6.0.post20201009 pip: 20.2.4 conda: installed pytest: 6.1.2 IPython: 7.19.0 sphinx: 3.3.0
Thanks for the clear issue @gerritholl . I agree — it's confusing if those two look the same.
Currently, one way of discriminating them:
In [6]: da1.indexes['x']
Out[6]:
IntervalIndex([(0.0, 0.6666666666666666], (0.6666666666666666, 1.3333333333333333], (1.3333333333333333, 2.0]],
closed='right',
name='x',
dtype='interval[float64]')
In [7]: da2.indexes['x']
Out[7]:
Index([ (0.0, 0.6666666666666666],
(0.6666666666666666, 1.3333333333333333],
(1.3333333333333333, 2.0]],
dtype='object', name='x')
One option is to push the dtype — 'interval[float64] vs object — or the Index type — IntervalIndex vs Index — values into the repr of the array:
In [8]: da1
Out[8]:
<xarray.DataArray (x: 3)>
array([0, 1, 2])
Coordinates:
* x (x) object (0.0, 0.6666666666666666] ... (1.3333333333333333, 2.0]
Could be:
* x (x) interval[float64] (0.0, 0.6666666666666666] ... (1.3333333333333333, 2.0]
What are others thoughts?
And ref https://github.com/pydata/xarray/projects/1
Perhaps Xarray has been too clever so far regarding how it handles pandas objects passed directly as coordinate data? pandas.MultiIndex objects are handled in a specific way too, which is often hard to deal with.
Expanding on @max-sixty's suggestion, we could:
- treat all coordinate data as duck arrays, i.e., in the example above handle
da1just likeda2(no more special cases for pandas objects) - provide an
xarray.indexes.PandasIntervalIndexwrapper, which would inherit fromxarray.indexes.PandasIndexwith a few addtionnal options and features, e.g., like the ones @dcherian suggests in https://github.com/pydata/xarray/discussions/6783#discussioncomment-3149033 - build an interval index from an existing coordinate using , e.g.,
da.set_xindex("x", PandasIntervalIndex, closed="right") - figure out how to assign both a coordinate and an index from an existing
pandas.IntervalIndexobject in a convenient but more explicit way