xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Refactor swap dims

Open benbovy opened this issue 1 year ago • 5 comments
trafficstars

  • [ ] Attempt at fixing #8646
  • [ ] Tests added
  • [ ] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • [ ] New functions/methods are listed in api.rst

I've tried here re-implementing swap_dims using rename_dims, drop_indexes and set_xindex. This fixes the example in #8646 but unfortunately this fails at handling the pandas multi-index special case (i.e., a single non-dimension coordinate wrapping a pd.MultiIndex that is promoted to a dimension coordinate in swap-dims auto-magically results in a PandasMultiIndex with both dimension and level coordinates).

benbovy avatar Apr 05 '24 08:04 benbovy

the pandas multi-index special case

I've been starting to think of multiindex as a model for a user defined custom index wrapping multiple variables.

Would we want to recreate a user's custom index with this swap_dims op? Probably not right?

dcherian avatar Apr 05 '24 15:04 dcherian

Does this fix https://github.com/pydata/xarray/issues/8914 (found by hypothesis)

dcherian avatar Apr 05 '24 15:04 dcherian

I've been starting to think of multiindex as a model for a user defined custom index wrapping multiple variables.

Yes this is a good model. That's what causes a lot of trouble since the index refactor as we transitioned from a dimension coordinate - with virtual level coordinates - to the model that you describe here.

Would we want to recreate a user's custom index with this swap_dims op? Probably not right?

No I don't think we want this. In both main and this PR no multi-index is recreated:

import pandas as pd
import xarray as xr

midx = pd.MultiIndex.from_product([["a", "b"], [0, 1]], names=["foo", "bar"])
coords = xr.Coordinates.from_pandas_multiindex(midx, "x")
coords["y"] = ("x", [1, 2, 3, 4])
ds = xr.Dataset(coords=coords)

ds.swap_dims(x="y")
# main: multi-index "x", "foo", "bar" is dropped (although the repr still shows "MultiIndex" for "x" values)
# this PR: ValueError, cannot remove index "x" which would corrupt index built from coordinates "x", "foo", "bar"

The error raised in this PR might make sense, or alternatively we can easily get all coordinates ("x", "foo", "bar") and drop the index.

However, like the Dataset constructor and a few other methods, swap_dims supports auto-detecting / auto-promoting pd.MultiIndex objects:

ds2 = xr.Dataset(coords={"y": ("x", midx), "x": [1, 2, 3, 4]})

ds2
# <xarray.Dataset> Size: 64B
# Dimensions:  (x: 4)
# Coordinates:
#     y        (x) object 32B ('a', 0) ('a', 1) ('b', 0) ('b', 1)
#   * x        (x) int64 32B 1 2 3 4
# Data variables:
#     *empty*


ds2.swap_dims(x="y")
# (in main branch)
# <xarray.Dataset> Size: 128B
# Dimensions:  (y: 4)
# Coordinates:
#   * y        (y) object 32B MultiIndex
#   * foo      (y) object 32B 'a' 'a' 'b' 'b'
#   * bar      (y) int64 32B 0 1 0 1
#     x        (y) int64 32B 1 2 3 4
# Data variables:
#    *empty*

This feels too magical to me, and makes the internal logic unnecessarily complicated. I'm not sure if there is an easy way to support that with the refactoring in this PR.

I'd be in favor of treating any multi-index passed directly as a single coordinate with tuples (#8140), i.e.,

ds2.swap_dims(x="y")
# Dimensions:  (x: 4)
# Coordinates:
#   * y        (y) object 32B ('a', 0) ('a', 1) ('b', 0) ('b', 1)
#     x        (y) int64 32B 1 2 3 4
# Data variables:
#     *empty*

with a warning raised.

benbovy avatar Apr 05 '24 16:04 benbovy

On main i see no warning: image

which looks like

or alternatively we can easily get all coordinates ("x", "foo", "bar") and drop the index.

I think this behaviour is good ^: we should preserve all variables, just destroy any fancy structure

swap_dims supports auto-detecting / auto-promoting pd.MultiIndex objects:

I support deprecating and deleting this behaviour

dcherian avatar Apr 05 '24 16:04 dcherian

However, like the Dataset constructor and a few other methods, swap_dims supports auto-detecting / auto-promoting pd.MultiIndex objects:

ds2 = xr.Dataset(coords={"y": ("x", midx), "x": [1, 2, 3, 4]})

ds2
# <xarray.Dataset> Size: 64B
# Dimensions:  (x: 4)
# Coordinates:
#     y        (x) object 32B ('a', 0) ('a', 1) ('b', 0) ('b', 1)
#   * x        (x) int64 32B 1 2 3 4
# Data variables:
#     *empty*


ds2.swap_dims(x="y")
# (in main branch)
# <xarray.Dataset> Size: 128B
# Dimensions:  (y: 4)
# Coordinates:
#   * y        (y) object 32B MultiIndex
#   * foo      (y) object 32B 'a' 'a' 'b' 'b'
#   * bar      (y) int64 32B 0 1 0 1
#     x        (y) int64 32B 1 2 3 4
# Data variables:
#    *empty*

This feels too magical to me, and makes the internal logic unnecessarily complicated. I'm not sure if there is an easy way to support that with the refactoring in this PR.

I've been a fan of MultiIndexes and simple pandas compatibility, but definitely agree that the "a 1D array of tuples magically becomes a MultiIndex" behavior would be fine / good to remove...

max-sixty avatar Apr 17 '24 16:04 max-sixty