xarray
xarray copied to clipboard
sel by slice not working for multi-index containing float-values
What happened?
da = xr.DataArray(np.random.rand(4), {'x': np.arange(4)})
da = da.assign_coords(y=('x', np.linspace(0, 1, 4)))
da = da.assign_coords(z=('x', np.arange(4) + 4))
da.set_index(x=["y", "z"]).sel(y=slice(None, 0.5))
fails with
TypeError: float() argument must be a string or a real number, not 'slice'
What did you expect to happen?
In v2022.3, this yields the correct sliced selection. Also, in v2022.6 this works for Multiindices without float-Values
da = xr.DataArray(np.random.rand(4), {'x': np.arange(4)})
da = da.assign_coords(y=('x', np.arange(4)))
da = da.assign_coords(z=('x', np.arange(4) + 4))
da.set_index(x=["y", "z"]).sel(y=slice(None, 2), z=slice(5, None))
(only that the resulting coordinates look a bit weird, containing slices). Also, the sliced selection for a regular float-based index works in v2202.6
da = xr.DataArray(np.random.rand(4), {'x': np.linspace(0, 1, 4)})
da.sel(x=slice(None, 0.5))
Minimal Complete Verifiable Example
import numpy as np
import xarray as xr
da = xr.DataArray(np.random.rand(4), {'x': np.arange(4)})
da = da.assign_coords(y=('x', np.linspace(0, 1, 4)))
da = da.assign_coords(z=('x', np.arange(4) + 4))
da.set_index(x=["y", "z"]).sel(y=slice(None, 0.5))
MVCE confirmation
- [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
- [X] Complete example — the example is self-contained, including all data and the text of any traceback.
- [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
- [X] New issue — a search of GitHub Issues suggests this is not a duplicate.
Relevant log output
File c:\mambaforge\envs\dev\lib\site-packages\xarray\core\dataarray.py:1420, in DataArray.sel(self, indexers, method, tolerance, drop, **indexers_kwargs)
1310 def sel(
1311 self: T_DataArray,
1312 indexers: Mapping[Any, Any] = None,
(...)
1316 **indexers_kwargs: Any,
1317 ) -> T_DataArray:
1318 """Return a new DataArray whose data is given by selecting index
1319 labels along the specified dimension(s).
1320
(...)
1418 Dimensions without coordinates: points
1419 """
-> 1420 ds = self._to_temp_dataset().sel(
1421 indexers=indexers,
1422 drop=drop,
1423 method=method,
1424 tolerance=tolerance,
...
197 # see https://github.com/pydata/xarray/issues/5727
--> 198 value = np.asarray(value, dtype=dtype)
199 return value
TypeError: float() argument must be a string or a real number, not 'slice'
Anything else we need to know?
Maybe related to #6836
Environment
INSTALLED VERSIONS
commit: None python: 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 06:57:19) [MSC v.1929 64 bit (AMD64)] python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 142 Stepping 12, GenuineIntel byteorder: little LC_ALL: None LANG: de_DE.UTF-8 LOCALE: ('de_DE', 'cp1252') libhdf5: None libnetcdf: None
xarray: 2022.6.0 pandas: 1.4.3 numpy: 1.23.1 scipy: 1.8.1 netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: None distributed: None matplotlib: 3.5.2 cartopy: None seaborn: 0.11.2 numbagg: None fsspec: None cupy: None pint: 0.19.2 sparse: None flox: None numpy_groupies: None setuptools: 63.2.0 pip: 22.2 conda: None pytest: 7.1.2 IPython: 8.4.0 sphinx: 4.5.0
Thanks for the report @nunupeke. That's definitely a regression. I'm not that surprised actually as the logic behind .sel() is already quite convoluted for the case of (pandas) multi-indexes :-).
only that the resulting coordinates look a bit weird, containing slices
I'm working on this issue right now and I see this too.
I don't think that providing slice objects to multi-index level coordinates in .sel() is something that so far we've really expected to work. We rather expect scalar values, and we explicitly raise a ValueError when trying to pass multiple values:
da
# <xarray.DataArray (x: 4)>
# array([0.30120807, 0.43951659, 0.19163508, 0.57251755])
# Coordinates:
# * x (x) object MultiIndex
# * y (x) float64 0.0 0.3333 0.6667 1.0
# * z (x) int64 4 5 6 7
da.sel(z=[4, 5])
# ValueError: Vectorized selection is not available along coordinate 'z' (multi-index level)
The fact that it used to work with slices looks like a side effect. For example, providing a slice for only one of the level coordinates drops that coordinate in the resulting dataset (v2022.3.0):
da.sel(y=slice(None, 0.5))
# <xarray.DataArray (z: 2)>
# array([0.4004091 , 0.11179854])
# Coordinates:
# * z (z) int64 4 5
#
# The 'y' coord is missing! It should be still there.
I think that we could support slices in a clean way (maybe any sequence of values too?) by reusing pandas.MultiIndex.get_locs() internally. However, in that case it would be hard to automatically collapse one or more multi-index levels, e.g.,
# note the difference between
da.sel(y=0.0, z=slice(None))
# <xarray.DataArray (x: 1)>
# array([0.08696024])
# Coordinates:
# * x (x) object MultiIndex
# * y (x) float64 0.0
# * z (x) int64 4
# and
da.sel(y=0.0)
# <xarray.DataArray (z: 1)>
# array([0.08696024])
# Coordinates:
# * z (z) int64 4
# y float64 0.0
# the 1st one calls pandas.MultiIndex.get_locs, while
# the 2nd one calls pandas.MultiIndex.get_loc_level
@nunupeke The TypeError in your example should be fixed in #7004, which also improves how slice objects are handled in general for a multi-index.