xarray icon indicating copy to clipboard operation
xarray copied to clipboard

to_base_variable: coerce multiindex data to numpy array

Open benbovy opened this issue 11 months ago • 3 comments

  • [x] Closes #8887, and probably supersedes #8809
  • [x] Tests added
  • [ ] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • ~~New functions/methods are listed in api.rst~~

@slevang this should also make work your test case added in #8809. I haven't added it here, instead I added a basic check that should be enough.

I don't really understand why the serialization backends (zarr?) do not seem to work with the PandasMultiIndexingAdapter.__array__() implementation, which should normally coerce the multi-index levels into numpy arrays as needed. Anyway, I guess that coercing it early like in this PR doesn't hurt and may avoid the confusion of a non-indexed, isolated coordinate variable that still wraps a pandas.MultiIndex.

benbovy avatar Mar 29 '24 10:03 benbovy

Thanks @benbovy, this seems good, but still doesn't fix my original issue in #8809. See comment there for more detail.

slevang avatar Mar 29 '24 14:03 slevang

This consistency check is still broken though, I pushed it to this branch.

import numpy as np
import xarray as xr

# ND DataArray that gets stacked along a multiindex
da = xr.DataArray(np.ones((3, 3)), coords={"dim1": [1, 2, 3], "dim2": [4, 5, 6]})
da = da.stack(feature=["dim1", "dim2"])

# Extract just the stacked coordinates for saving in a dataset
ds = xr.Dataset(data_vars={"feature": da.feature})
xr.testing.assertions._assert_internal_invariants(ds.reset_index(["feature", "dim1", "dim2"]), check_default_indexes=False) # succeeds
xr.testing.assertions._assert_internal_invariants(ds.reset_index(["feature"]), check_default_indexes=False) # fails, but no warning either

dcherian avatar Mar 29 '24 14:03 dcherian

Wow it took me some time to figure that out:

ds = xr.Dataset(data_vars={"feature": da.feature})

So it detects the multi-index from da.feature, then assigns it to the feature variable, auto-promotes the later to a coordinate and finally auto-creates coordinates and indexes for the multi-index levels. That's a lot happening under the hood! The internal logic for handling this is complicated, very fragile and actually still buggy (in this case Xarray wrongly creates two Xarray indexes for the level coordinates and for the "feature" dimension coordinate respectively, so reset_index won't work as expected).

This is being addressed / discussed in #8140.

benbovy avatar Mar 29 '24 15:03 benbovy