xarray
xarray copied to clipboard
unstack confusing re `Variable` / `IndexVariable`
What happened?
using unstack on a DataArray generated using the .dt.daysinmonth accessor with time as a multiIndex fails with a ValueError. The mysterious part is that when I build an "identical" DataArray starting from the .data of that same array, it works as expected (see output of example code).
I asked a colleague for help with this, and she said the attached code worked for older versions of xarray, but said it seems to be broken starting at 2023.5.0.
What did you expect to happen?
Expected to get a DataArray (days0) with dimensions ('year', 'month') with sizes (2, 12), which is what I get with the alternate DataArray (called days).
Minimal Complete Verifiable Example
import sys
print(f"python {sys.version}")
import xarray as xr
import numpy as np
import cftime
print(f"numpy: {np.__version__}, xarray: {xr.__version__}, cftime: {cftime.__version__}")
t = np.array([cftime.DatetimeGregorian(1979, 1, 1, 0, 0, 0, 0, has_year_zero=False),
cftime.DatetimeGregorian(1979, 2, 1, 0, 0, 0, 0, has_year_zero=False),
cftime.DatetimeGregorian(1979, 3, 1, 0, 0, 0, 0, has_year_zero=False),
cftime.DatetimeGregorian(1979, 4, 1, 0, 0, 0, 0, has_year_zero=False),
cftime.DatetimeGregorian(1979, 5, 1, 0, 0, 0, 0, has_year_zero=False),
cftime.DatetimeGregorian(1979, 6, 1, 0, 0, 0, 0, has_year_zero=False),
cftime.DatetimeGregorian(1979, 7, 1, 0, 0, 0, 0, has_year_zero=False),
cftime.DatetimeGregorian(1979, 8, 1, 0, 0, 0, 0, has_year_zero=False),
cftime.DatetimeGregorian(1979, 9, 1, 0, 0, 0, 0, has_year_zero=False),
cftime.DatetimeGregorian(1979, 10, 1, 0, 0, 0, 0, has_year_zero=False),
cftime.DatetimeGregorian(1979, 11, 1, 0, 0, 0, 0, has_year_zero=False),
cftime.DatetimeGregorian(1979, 12, 1, 0, 0, 0, 0, has_year_zero=False),
cftime.DatetimeGregorian(1980, 1, 1, 0, 0, 0, 0, has_year_zero=False),
cftime.DatetimeGregorian(1980, 2, 1, 0, 0, 0, 0, has_year_zero=False),
cftime.DatetimeGregorian(1980, 3, 1, 0, 0, 0, 0, has_year_zero=False),
cftime.DatetimeGregorian(1980, 4, 1, 0, 0, 0, 0, has_year_zero=False),
cftime.DatetimeGregorian(1980, 5, 1, 0, 0, 0, 0, has_year_zero=False),
cftime.DatetimeGregorian(1980, 6, 1, 0, 0, 0, 0, has_year_zero=False),
cftime.DatetimeGregorian(1980, 7, 1, 0, 0, 0, 0, has_year_zero=False),
cftime.DatetimeGregorian(1980, 8, 1, 0, 0, 0, 0, has_year_zero=False),
cftime.DatetimeGregorian(1980, 9, 1, 0, 0, 0, 0, has_year_zero=False),
cftime.DatetimeGregorian(1980, 10, 1, 0, 0, 0, 0, has_year_zero=False),
cftime.DatetimeGregorian(1980, 11, 1, 0, 0, 0, 0, has_year_zero=False),
cftime.DatetimeGregorian(1980, 12, 1, 0, 0, 0, 0, has_year_zero=False)])
dss = xr.DataArray(t, dims=['time'], coords={"time":t})
# TWO VERSIONS OF "days":
days0 = dss['time'].dt.daysinmonth
days = xr.DataArray(dss['time'].dt.daysinmonth.data, dims=['time'], coords={'time':dss['time']}, attrs=days0.attrs, name='days_in_month')
print(f"IDENTICAL: {days.identical(days0)}")
year = dss['time'].dt.year.data
month = dss['time'].dt.month.data
# REPEAT SAME STEPS FOR days and days0:
days = days.assign_coords(year=("time", year), month=("time", month))
days = days.set_index(time=['year', 'month'])
days0 = days0.assign_coords(year=("time", year), month=("time", month))
days0 = days0.set_index(time=['year', 'month'])
print(f"IDENTICAL: {days.identical(days0)}")
days = days.unstack('time') # THIS WORKS
print(f"{days.dims = }")
#
days0 = days0.unstack('time') # THIS FAILS
print(f"{days0.dims = }")
MVCE confirmation
- [ ] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
- [ ] Complete example — the example is self-contained, including all data and the text of any traceback.
- [ ] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
- [ ] New issue — a search of GitHub Issues suggests this is not a duplicate.
- [ ] Recent environment — the issue occurs with the latest version of xarray and its dependencies.
Relevant log output
python 3.12.0 | packaged by conda-forge | (main, Oct 3 2023, 08:36:57) [Clang 15.0.7 ]
numpy: 1.26.4, xarray: 2024.5.0, cftime: 1.6.3
IDENTICAL: True
IDENTICAL: True
days.dims = ('year', 'month')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[3], line 55
53 print(f"{days.dims = }")
54 #
---> 55 days0 = days0.unstack('time') # THIS FAILS
56 print(f"{days0.dims = }")
File ~/opt/miniconda3/envs/p12/lib/python3.12/site-packages/xarray/util/deprecation_helpers.py:115, in _deprecate_positional_args.<locals>._decorator.<locals>.inner(*args, **kwargs)
111 kwargs.update({name: arg for name, arg in zip_args})
113 return func(*args[:-n_extra_args], **kwargs)
--> 115 return func(*args, **kwargs)
File ~/opt/miniconda3/envs/p12/lib/python3.12/site-packages/xarray/core/dataarray.py:2950, in DataArray.unstack(self, dim, fill_value, sparse)
2888 @_deprecate_positional_args("v2023.10.0")
2889 def unstack(
2890 self,
(...)
2894 sparse: bool = False,
2895 ) -> Self:
2896 """
2897 Unstack existing dimensions corresponding to MultiIndexes into
2898 multiple new dimensions.
(...)
2948 DataArray.stack
2949 """
-> 2950 ds = self._to_temp_dataset().unstack(dim, fill_value=fill_value, sparse=sparse)
2951 return self._from_temp_dataset(ds)
File ~/opt/miniconda3/envs/p12/lib/python3.12/site-packages/xarray/util/deprecation_helpers.py:115, in _deprecate_positional_args.<locals>._decorator.<locals>.inner(*args, **kwargs)
111 kwargs.update({name: arg for name, arg in zip_args})
113 return func(*args[:-n_extra_args], **kwargs)
--> 115 return func(*args, **kwargs)
File ~/opt/miniconda3/envs/p12/lib/python3.12/site-packages/xarray/core/dataset.py:5663, in Dataset.unstack(self, dim, fill_value, sparse)
5659 result = result._unstack_full_reindex(
5660 d, stacked_indexes[d], fill_value, sparse
5661 )
5662 else:
-> 5663 result = result._unstack_once(d, stacked_indexes[d], fill_value, sparse)
5664 return result
File ~/opt/miniconda3/envs/p12/lib/python3.12/site-packages/xarray/core/dataset.py:5496, in Dataset._unstack_once(self, dim, index_and_vars, fill_value, sparse)
5493 else:
5494 fill_value_ = fill_value
-> 5496 variables[name] = var._unstack_once(
5497 index=clean_index,
5498 dim=dim,
5499 fill_value=fill_value_,
5500 sparse=sparse,
5501 )
5502 else:
5503 variables[name] = var
File ~/opt/miniconda3/envs/p12/lib/python3.12/site-packages/xarray/core/variable.py:1552, in Variable._unstack_once(self, index, dim, fill_value, sparse)
1547 # Indexer is a list of lists of locations. Each list is the locations
1548 # on the new dimension. This is robust to the data being sparse; in that
1549 # case the destinations will be NaN / zero.
1550 data[(..., *indexer)] = reordered
-> 1552 return self._replace(dims=new_dims, data=data)
File ~/opt/miniconda3/envs/p12/lib/python3.12/site-packages/xarray/core/variable.py:957, in Variable._replace(self, dims, data, attrs, encoding)
955 if encoding is _default:
956 encoding = copy.copy(self._encoding)
--> 957 return type(self)(dims, data, attrs, encoding, fastpath=True)
File ~/opt/miniconda3/envs/p12/lib/python3.12/site-packages/xarray/core/variable.py:2625, in IndexVariable.__init__(self, dims, data, attrs, encoding, fastpath)
2623 super().__init__(dims, data, attrs, encoding, fastpath)
2624 if self.ndim != 1:
-> 2625 raise ValueError(f"{type(self).__name__} objects must be 1-dimensional")
2627 # Unlike in Variable, always eagerly load values into memory
2628 if not isinstance(self._data, PandasIndexingAdapter):
ValueError: IndexVariable objects must be 1-dimensional
Anything else we need to know?
No response
Environment
INSTALLED VERSIONS
commit: None python: 3.12.0 | packaged by conda-forge | (main, Oct 3 2023, 08:36:57) [Clang 15.0.7 ] python-bits: 64 OS: Darwin OS-release: 23.5.0 machine: arm64 processor: arm byteorder: little LC_ALL: None LANG: None LOCALE: (None, 'UTF-8') libhdf5: 1.14.3 libnetcdf: 4.9.2
xarray: 2024.5.0 pandas: 2.2.2 numpy: 1.26.4 scipy: 1.13.0 netCDF4: 1.6.5 pydap: None h5netcdf: 1.3.0 h5py: 3.11.0 zarr: None cftime: 1.6.3 nc_time_axis: 1.4.1 iris: None bottleneck: 1.3.8 dask: 2024.5.0 distributed: 2024.5.0 matplotlib: 3.8.4 cartopy: 0.23.0 seaborn: None numbagg: None fsspec: 2024.5.0 cupy: None pint: 0.24.1 sparse: 0.15.1 flox: None numpy_groupies: None setuptools: 69.5.1 pip: 24.0 conda: None pytest: None mypy: None IPython: 8.24.0 sphinx: None
Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!
Can we strip off much more from the example?
I see days and days0 are quite different — can we make them more any more similar and still see the failure? Does it require using cftime?
This is maybe a more minimal example—it does not require cftime or times in general:
source = xr.DataArray(range(2), dims=["x"], coords=[["a", "b"]])
da = source.x
da = da.assign_coords(y=("x", ["c", "d"]), z=("x", ["e", "f"]))
da = da.set_index(x=["y", "z"])
da.unstack("x")
I think the issue relates to the fact that da.variable is an IndexVariable instead of a Variable. I'd have to do more digging to see if there was a time that this worked.
The v2023.5.0 breakpoint in the original example is maybe a bit of a red herring in that it appears that dss.time.dt.daysinmonth switched from returning a Variable-backed DataArray to an IndexVariable-backed DataArray at that time.
Thanks @spencerkclark -- that's a better minimal example and diagnosis. I couldn't figure out how to tell the difference between days and days0.
Just to confirm, I went through and tested this out quickly and Spencer's example does indeed fail in an older version as well.
It looks like in the particular case Brian presented dss.time was an IndexVariable which then was returned as a Variable by dt.daysinmonth in older (pre v2023.5.0) versions. This allowed the code to work previously whereas now it's returned as an IndexVariable and then fails because of that. So maybe a case of it accidentally working before.
Yup, this is consistent with what I found. I should clarify, I'm not sure if the current behavior is intentional—it would be nice if the minimal example (and your more real-world use-case) worked.
Thanks @spencerkclark !
I updated the title — feel free to refine a bit more