xarray
xarray copied to clipboard
Refactor swap dims
- [ ] Attempt at fixing #8646
- [ ] Tests added
- [ ] User visible changes (including notable bug fixes) are documented in
whats-new.rst - [ ] New functions/methods are listed in
api.rst
I've tried here re-implementing swap_dims using rename_dims, drop_indexes and set_xindex. This fixes the example in #8646 but unfortunately this fails at handling the pandas multi-index special case (i.e., a single non-dimension coordinate wrapping a pd.MultiIndex that is promoted to a dimension coordinate in swap-dims auto-magically results in a PandasMultiIndex with both dimension and level coordinates).
the pandas multi-index special case
I've been starting to think of multiindex as a model for a user defined custom index wrapping multiple variables.
Would we want to recreate a user's custom index with this swap_dims op? Probably not right?
Does this fix https://github.com/pydata/xarray/issues/8914 (found by hypothesis)
I've been starting to think of multiindex as a model for a user defined custom index wrapping multiple variables.
Yes this is a good model. That's what causes a lot of trouble since the index refactor as we transitioned from a dimension coordinate - with virtual level coordinates - to the model that you describe here.
Would we want to recreate a user's custom index with this swap_dims op? Probably not right?
No I don't think we want this. In both main and this PR no multi-index is recreated:
import pandas as pd
import xarray as xr
midx = pd.MultiIndex.from_product([["a", "b"], [0, 1]], names=["foo", "bar"])
coords = xr.Coordinates.from_pandas_multiindex(midx, "x")
coords["y"] = ("x", [1, 2, 3, 4])
ds = xr.Dataset(coords=coords)
ds.swap_dims(x="y")
# main: multi-index "x", "foo", "bar" is dropped (although the repr still shows "MultiIndex" for "x" values)
# this PR: ValueError, cannot remove index "x" which would corrupt index built from coordinates "x", "foo", "bar"
The error raised in this PR might make sense, or alternatively we can easily get all coordinates ("x", "foo", "bar") and drop the index.
However, like the Dataset constructor and a few other methods, swap_dims supports auto-detecting / auto-promoting pd.MultiIndex objects:
ds2 = xr.Dataset(coords={"y": ("x", midx), "x": [1, 2, 3, 4]})
ds2
# <xarray.Dataset> Size: 64B
# Dimensions: (x: 4)
# Coordinates:
# y (x) object 32B ('a', 0) ('a', 1) ('b', 0) ('b', 1)
# * x (x) int64 32B 1 2 3 4
# Data variables:
# *empty*
ds2.swap_dims(x="y")
# (in main branch)
# <xarray.Dataset> Size: 128B
# Dimensions: (y: 4)
# Coordinates:
# * y (y) object 32B MultiIndex
# * foo (y) object 32B 'a' 'a' 'b' 'b'
# * bar (y) int64 32B 0 1 0 1
# x (y) int64 32B 1 2 3 4
# Data variables:
# *empty*
This feels too magical to me, and makes the internal logic unnecessarily complicated. I'm not sure if there is an easy way to support that with the refactoring in this PR.
I'd be in favor of treating any multi-index passed directly as a single coordinate with tuples (#8140), i.e.,
ds2.swap_dims(x="y")
# Dimensions: (x: 4)
# Coordinates:
# * y (y) object 32B ('a', 0) ('a', 1) ('b', 0) ('b', 1)
# x (y) int64 32B 1 2 3 4
# Data variables:
# *empty*
with a warning raised.
On main i see no warning:
which looks like
or alternatively we can easily get all coordinates ("x", "foo", "bar") and drop the index.
I think this behaviour is good ^: we should preserve all variables, just destroy any fancy structure
swap_dims supports auto-detecting / auto-promoting pd.MultiIndex objects:
I support deprecating and deleting this behaviour
However, like the Dataset constructor and a few other methods,
swap_dimssupports auto-detecting / auto-promotingpd.MultiIndexobjects:ds2 = xr.Dataset(coords={"y": ("x", midx), "x": [1, 2, 3, 4]}) ds2 # <xarray.Dataset> Size: 64B # Dimensions: (x: 4) # Coordinates: # y (x) object 32B ('a', 0) ('a', 1) ('b', 0) ('b', 1) # * x (x) int64 32B 1 2 3 4 # Data variables: # *empty* ds2.swap_dims(x="y") # (in main branch) # <xarray.Dataset> Size: 128B # Dimensions: (y: 4) # Coordinates: # * y (y) object 32B MultiIndex # * foo (y) object 32B 'a' 'a' 'b' 'b' # * bar (y) int64 32B 0 1 0 1 # x (y) int64 32B 1 2 3 4 # Data variables: # *empty*This feels too magical to me, and makes the internal logic unnecessarily complicated. I'm not sure if there is an easy way to support that with the refactoring in this PR.
I've been a fan of MultiIndexes and simple pandas compatibility, but definitely agree that the "a 1D array of tuples magically becomes a MultiIndex" behavior would be fine / good to remove...