xarray
xarray copied to clipboard
Using a tuple as a sequence in DataArray.sel no longer supported?
What happened?
Version 2022.6.0 produces an error when I try something like data_array.sel(coordintate=(val1, val2)). Now this only works if the sequence values are provided as a list instead.
What did you expect to happen?
In previous versions, tuples could also be supplied. However, I've been digging into this a bit, and I understand that there are generally some limitations on using tuples (or rather, they are sometimes overloaded). For example, it seems that in any version, I can't use a tuple as an input coordinate to initialize a DataArray, as I get an error Could not convert tuple of form (dims, data[, attrs, encoding]) (this is known). I wanted to report the current bug however since the behavior is different in 2022.6.0 compared to previous versions, and to clarify whether not supporting tuples as sel coordinates is expected or not. It is not very clear from the error message and from the docs. The example below works on < 2022.6.0 but raises an error on 2022.6.0.
Minimal Complete Verifiable Example
import xarray as xr
import numpy as np
arr = xr.DataArray(data=np.random.rand(10), coords={"c1": np.arange(10, dtype=np.float64)})
item = arr.sel(c1=(1, 2))
MVCE confirmation
- [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
- [X] Complete example — the example is self-contained, including all data and the text of any traceback.
- [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
- [X] New issue — a search of GitHub Issues suggests this is not a duplicate.
Relevant log output
No response
Anything else we need to know?
No response
Environment
INSTALLED VERSIONS
commit: None python: 3.9.12 (main, Jun 1 2022, 11:38:51) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 5.13.0-52-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: None
xarray: 2022.6.0 pandas: 1.4.3 numpy: 1.23.0 scipy: 1.8.1 netCDF4: None pydap: None h5netcdf: None h5py: 3.7.0 Nio: None zarr: None cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2022.6.1 distributed: None matplotlib: 3.5.2 cartopy: None seaborn: None numbagg: None fsspec: 2022.5.0 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 61.2.0 pip: 21.2.4 conda: None pytest: 7.1.2 IPython: 8.4.0 sphinx: None
Thanks for the report @momchil-flex. That's definitely a regression.
However, I wonder what should we do: depreciate interpreting tuples as sequences and always consider them as "scalar" values or continue interpreting it differently depending on the cases?
For example, tuples indexer values were (and still are) assumed to be single element values when selecting on a dimension coordinate with a multi-index (although eventually the multi-index dimension coordinate might be depreciated in xarray):
da = xr.DataArray(
data=range(3),
dims="x",
coords={"a": ("x", ["a", "a", "c"]), "b": ("x", [0, 1, 2])},
).set_index(x=["a", "b"])
da
# <xarray.DataArray (x: 3)>
# array([0, 1, 2])
# Coordinates:
# * x (x) object MultiIndex
# * a (x) <U1 'a' 'a' 'c'
# * b (x) int64 0 1 2
da.sel(x=("a", 1))
# <xarray.DataArray ()>
# array(1)
# Coordinates:
# x object ('a', 1)
# a <U1 'a'
# b int64 1
Pros of always treating a tuple as 1-element indexer value:
- Clearer
- Less special cases to maintain internally in Xarray
Cons:
- With flexible indexes, Xarray currently just passes the indexers to the corresponding (custom) indexes, leaving the responsibility to those indexes to process them as they want. Although we might have some control on the behavior of
PandasIndexandPandasMultiIndexbuilt-in Xarray, we have no control on 3rd party indexes. Unless we somehow formalize the semantics of the indexer values passed in.sel(), but this could be challenging as there could be many kinds of indexers (scalar types, tuples, lists, slices, numpy arrays, xarrayVariableorDataArrayobjects, etc.).
I like the idea of just passing tuples through and letting the index deal with it. Just like a MultiIndex, there may be other cases where this makes sense.
For the current PandasIndex maybe we can raise a nicer error in .sel?