xarray icon indicating copy to clipboard operation
xarray copied to clipboard

sel by slice not working for multi-index containing float-values

Open nunupeke opened this issue 3 years ago • 3 comments

What happened?

da = xr.DataArray(np.random.rand(4), {'x': np.arange(4)})
da = da.assign_coords(y=('x', np.linspace(0, 1, 4)))
da = da.assign_coords(z=('x', np.arange(4) + 4))
da.set_index(x=["y", "z"]).sel(y=slice(None, 0.5))

fails with

TypeError: float() argument must be a string or a real number, not 'slice'

What did you expect to happen?

In v2022.3, this yields the correct sliced selection. Also, in v2022.6 this works for Multiindices without float-Values

da = xr.DataArray(np.random.rand(4), {'x': np.arange(4)})
da = da.assign_coords(y=('x', np.arange(4)))
da = da.assign_coords(z=('x', np.arange(4) + 4))
da.set_index(x=["y", "z"]).sel(y=slice(None, 2), z=slice(5, None))

(only that the resulting coordinates look a bit weird, containing slices). Also, the sliced selection for a regular float-based index works in v2202.6

da = xr.DataArray(np.random.rand(4), {'x': np.linspace(0, 1, 4)})
da.sel(x=slice(None, 0.5))

Minimal Complete Verifiable Example

import numpy as np
import xarray as xr

da = xr.DataArray(np.random.rand(4), {'x': np.arange(4)})
da = da.assign_coords(y=('x', np.linspace(0, 1, 4)))
da = da.assign_coords(z=('x', np.arange(4) + 4))
da.set_index(x=["y", "z"]).sel(y=slice(None, 0.5))

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

File c:\mambaforge\envs\dev\lib\site-packages\xarray\core\dataarray.py:1420, in DataArray.sel(self, indexers, method, tolerance, drop, **indexers_kwargs)
   1310 def sel(
   1311     self: T_DataArray,
   1312     indexers: Mapping[Any, Any] = None,
   (...)
   1316     **indexers_kwargs: Any,
   1317 ) -> T_DataArray:
   1318     """Return a new DataArray whose data is given by selecting index
   1319     labels along the specified dimension(s).
   1320 
   (...)
   1418     Dimensions without coordinates: points
   1419     """
-> 1420     ds = self._to_temp_dataset().sel(
   1421         indexers=indexers,
   1422         drop=drop,
   1423         method=method,
   1424         tolerance=tolerance,
...
    197     # see https://github.com/pydata/xarray/issues/5727
--> 198     value = np.asarray(value, dtype=dtype)
    199 return value

TypeError: float() argument must be a string or a real number, not 'slice'

Anything else we need to know?

Maybe related to #6836

Environment

INSTALLED VERSIONS

commit: None python: 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 06:57:19) [MSC v.1929 64 bit (AMD64)] python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 142 Stepping 12, GenuineIntel byteorder: little LC_ALL: None LANG: de_DE.UTF-8 LOCALE: ('de_DE', 'cp1252') libhdf5: None libnetcdf: None

xarray: 2022.6.0 pandas: 1.4.3 numpy: 1.23.1 scipy: 1.8.1 netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: None distributed: None matplotlib: 3.5.2 cartopy: None seaborn: 0.11.2 numbagg: None fsspec: None cupy: None pint: 0.19.2 sparse: None flox: None numpy_groupies: None setuptools: 63.2.0 pip: 22.2 conda: None pytest: 7.1.2 IPython: 8.4.0 sphinx: 4.5.0

nunupeke avatar Jul 27 '22 15:07 nunupeke

Thanks for the report @nunupeke. That's definitely a regression. I'm not that surprised actually as the logic behind .sel() is already quite convoluted for the case of (pandas) multi-indexes :-).

benbovy avatar Jul 29 '22 12:07 benbovy

only that the resulting coordinates look a bit weird, containing slices

I'm working on this issue right now and I see this too.

I don't think that providing slice objects to multi-index level coordinates in .sel() is something that so far we've really expected to work. We rather expect scalar values, and we explicitly raise a ValueError when trying to pass multiple values:

da
# <xarray.DataArray (x: 4)>
# array([0.30120807, 0.43951659, 0.19163508, 0.57251755])
# Coordinates:
#   * x        (x) object MultiIndex
#   * y        (x) float64 0.0 0.3333 0.6667 1.0
#   * z        (x) int64 4 5 6 7

da.sel(z=[4, 5])
# ValueError: Vectorized selection is not available along coordinate 'z' (multi-index level)

The fact that it used to work with slices looks like a side effect. For example, providing a slice for only one of the level coordinates drops that coordinate in the resulting dataset (v2022.3.0):

da.sel(y=slice(None, 0.5))
# <xarray.DataArray (z: 2)>
# array([0.4004091 , 0.11179854])
# Coordinates:
#  * z        (z) int64 4 5
# 
# The 'y' coord is missing! It should be still there.

I think that we could support slices in a clean way (maybe any sequence of values too?) by reusing pandas.MultiIndex.get_locs() internally. However, in that case it would be hard to automatically collapse one or more multi-index levels, e.g.,

# note the difference between

da.sel(y=0.0, z=slice(None))
# <xarray.DataArray (x: 1)>
# array([0.08696024])
# Coordinates:
#   * x        (x) object MultiIndex
#   * y        (x) float64 0.0
#   * z        (x) int64 4

# and

da.sel(y=0.0)
# <xarray.DataArray (z: 1)>
# array([0.08696024])
# Coordinates:
#  * z        (z) int64 4
#    y        float64 0.0

# the 1st one calls pandas.MultiIndex.get_locs, while
# the 2nd one calls pandas.MultiIndex.get_loc_level

benbovy avatar Sep 07 '22 13:09 benbovy

@nunupeke The TypeError in your example should be fixed in #7004, which also improves how slice objects are handled in general for a multi-index.

benbovy avatar Sep 07 '22 15:09 benbovy