xarray icon indicating copy to clipboard operation
xarray copied to clipboard

MultiIndex listed multiple times in Dataset.indexes property

Open lukasbindreiter opened this issue 1 year ago • 1 comments

What happened?

When upgrading to 2022.6.0.rc0 from 2022.3.0 I noticed a possible unexpected breaking change in the Dataset.indexes property. MultiIndices are now listed for each dimension they apply for as well as once for the multi index itself when accessing dataset.indexes.

What did you expect to happen?

Same behaviour as before, see example below.

Minimal Complete Verifiable Example

# execute with 2022.3.0 and 2022.6.0.rc0 to see the differences
import pandas
import xarray as xr

def _create_multiindex(**kwargs):
    return pandas.MultiIndex.from_arrays(list(kwargs.values()), names=kwargs.keys())


ds = xr.Dataset()
ds.coords["measurement"] = _create_multiindex(
    observation=["A", "A", "B", "B"],
    wavelength=[0.4, 0.5, 0.6, 0.7],
    stokes=["I", "Q", "I", "I"],
)

for name, idx in ds.indexes.items():
    print(name, idx)

MVCE confirmation

  • [x] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [x] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [x] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [x] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

Output with version 2022.3.0:

measurement MultiIndex([('A', 0.4, 'I'),
            ('A', 0.5, 'Q'),
            ('B', 0.6, 'I'),
            ('B', 0.7, 'I')],
           names=['observation', 'wavelength', 'stokes'])


Output with version 2022.6.0.rc0:

measurement MultiIndex([('A', 0.4, 'I'),
            ('A', 0.5, 'Q'),
            ('B', 0.6, 'I'),
            ('B', 0.7, 'I')],
           name='measurement')
observation MultiIndex([('A', 0.4, 'I'),
            ('A', 0.5, 'Q'),
            ('B', 0.6, 'I'),
            ('B', 0.7, 'I')],
           name='measurement')
wavelength MultiIndex([('A', 0.4, 'I'),
            ('A', 0.5, 'Q'),
            ('B', 0.6, 'I'),
            ('B', 0.7, 'I')],
           name='measurement')
stokes MultiIndex([('A', 0.4, 'I'),
            ('A', 0.5, 'Q'),
            ('B', 0.6, 'I'),
            ('B', 0.7, 'I')],
           name='measurement')

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None python: 3.8.10 (default, Jan 28 2022, 09:41:12) [GCC 9.3.0] python-bits: 64 OS: Linux OS-release: 5.10.102.1-microsoft-standard-WSL2 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: C.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.10.5 libnetcdf: 4.6.3

xarray: 2022.3.0 pandas: 1.4.3 numpy: 1.23.0 scipy: 1.9.0rc1 netCDF4: 1.5.4 pydap: None h5netcdf: None h5py: 3.7.0 Nio: None zarr: None cftime: 1.6.1 nc_time_axis: None PseudoNetCDF: None rasterio: 1.3b3 cfgrib: None iris: None bottleneck: None dask: None distributed: None matplotlib: 3.5.2 cartopy: 0.19.0.post1 seaborn: None numbagg: None fsspec: None cupy: None pint: None sparse: None setuptools: 56.0.0 pip: 21.3.1 conda: None pytest: 7.1.2 IPython: 8.4.0 sphinx: 4.5.0

INSTALLED VERSIONS

commit: None python: 3.8.10 (default, Jan 28 2022, 09:41:12) [GCC 9.3.0] python-bits: 64 OS: Linux OS-release: 5.10.102.1-microsoft-standard-WSL2 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: C.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.10.5 libnetcdf: 4.6.3

xarray: 2022.6.0rc0 pandas: 1.4.3 numpy: 1.23.0 scipy: 1.9.0rc1 netCDF4: 1.5.4 pydap: None h5netcdf: None h5py: 3.7.0 Nio: None zarr: None cftime: 1.6.1 nc_time_axis: None PseudoNetCDF: None rasterio: 1.3b3 cfgrib: None iris: None bottleneck: None dask: None distributed: None matplotlib: 3.5.2 cartopy: 0.19.0.post1 seaborn: None numbagg: None fsspec: None cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 56.0.0 pip: 21.3.1 conda: None pytest: 7.1.2 IPython: 8.4.0 sphinx: 4.5.0

lukasbindreiter avatar Jul 04 '22 18:07 lukasbindreiter

Thanks for trying out our pre-release @lukasbindreiter !

This is an intentional change.

Can you tell us more about why this breaks your code?

dcherian avatar Jul 05 '22 15:07 dcherian

The change is because starting from version 2022.6.0, multi-index level coordinates are no longer "virtual" but now correspond to real coordinates. The .indexes and .xindexes properties are mappings relating coordinates to their index.

There has been some discussions prior to the explicit indexes refactor about whether those properties should return a mapping of a unique vs. non-unique index objects. We choose the latter as it simplifies a lot of things internally (and perhaps externally too).

@lukasbindreiter although it is unlikely that we'll change this in the future, it would be interesting to get your feedback! How does this choice impact your workflow?

Note that both .indexes and .xindexes return an Indexes object, which has a convenient .get_unique() method that returns a list of unique index objects. It also has other convenient methods, although those are not well documented yet.

benbovy avatar Aug 31 '22 15:08 benbovy

We used the .indexes property to implement a workaround for serializing Datasets containing multiindices to netCDF. For this the implementation basically looked like this: (Inspired by and related to this issue: https://github.com/pydata/xarray/issues/1077)

Saving dataset as NetCDF:

  1. Loop over dataset.indexes
  2. Check if index is a multi index
  3. If so, encode it somehow and save it as attribute in the dataset
  4. Reset (remove) the index
  5. Now save this "patched" dataset as NetCDF

And then loading it again:

  1. Load the dataset
  2. Check if this special attribute exists
  3. If so decode the multiindex and set it as index in the dataset

When testing the pre-release version I noticed some of our tests failing, which is why I raised this issue in the first place - in case those changes were unwanted. I was not aware that you were actively working on multi index changes and therefore expecting API changes here. With that in mind I'll probably be able to adapt our code to this new API of indexes and xindexes.

lukasbindreiter avatar Sep 05 '22 08:09 lukasbindreiter

@benbovy I also just tested the get_unique() method that you mentioned and maybe noticed a related issue here, which I'm not sure is wanted / expected.

Taking the above dataset ds, accessing this function results in an error:

> ds.indexes.get_unique()

TypeError: unhashable type: 'MultiIndex'

However, for xindexes it works:

> ds.xindexes.get_unique()

[<xarray.core.indexes.PandasMultiIndex at 0x7f105bf1df20>]

lukasbindreiter avatar Sep 05 '22 08:09 lukasbindreiter

Thanks for the issue report @lukasbindreiter, I opened #6987. As a workaround, you could use Indexes.group_by_index(), which shouldn't have any hash issue and which might be better fitted for your use case.

Regarding (de)serialization (from)to netCDF or other formats, I wonder if building multi-indexes or other custom indexes when opening the dataset couldn't be done via some custom Xarray IO backend (https://docs.xarray.dev/en/stable/internals/how-to-add-new-backend.html). I'm not sure how easy / hard it is to implement a custom backend on top of an existing one, though. For the serialization, Xarray doesn't support custom writable backends (yet), but since multi-index levels are now real coordinates maybe a custom backend is not really needed. Right now Xarray raises NotImplementedError when trying to save a variable wrapping a multi-index, but probably we could just get rid of the multi-index "dimension" coordinate (tuple elements) and save level coordinates like any other variable.

benbovy avatar Sep 05 '22 09:09 benbovy

Thanks for the suggestions, I'll look into this Indexes.group_by_index and see if that is able to resolve our issue.

And with regards to the (de)serialization: I haven't investigated yet how the index changes in 2022.6 affect our initial usecase, maybe an approach such as the suggested with a custom backend may be even a better solution for us then. Though probably we would ideally need a way to override NetCDF4BackendEntrypoint with another Entrypoint as the default for .nc files

As for the original issue discussed here: That can probably be closed then, since it was an intentional change. But finding information about those changes right now was not so easy, is there some resource available where I can read up about the changes to indexes and functions related to them. (e.g. I was unaware of the existence of xindexes or Indexes.group_by_index)?

lukasbindreiter avatar Sep 05 '22 12:09 lukasbindreiter

But finding information about those changes right now was not so easy, is there some resource available where I can read up about the changes to indexes and functions related to them.

Not yet, this still has to be detailed in the documentation (tracked in #6293 along with other todo items related to indexes). The Indexes API already has some basic docstrings, though: https://github.com/pydata/xarray/blob/main/xarray/core/indexes.py#L1008-L1225

benbovy avatar Sep 05 '22 12:09 benbovy

That can probably be closed then, since it was an intentional change.

Yes I think we can close it. Thanks for your feedback and for the issue report!

benbovy avatar Sep 05 '22 12:09 benbovy