xarray
xarray copied to clipboard
`rename_vars` followed by `swap_dims` and `merge` causes swapped dim to reappear
What happened?
I wanted to rename a dimension coordinate for two datasets before merging: ds = ds.rename_vars(y="z").swap_dims(y="z")
, and the same for the second data set. After merging the datasets, the merged result has the dimension "y" in addition to "z".
Swapping the order of rename_vars
and swap_dims
before merging works in that "y" does not reappear, but then "z" is listed as a non-dimension coordinate.
Doing rename_vars
followed by swap_dims
/after/ merging gives the result I wanted, but if I merge again, the same issue occurs.
My current solution is to only rename dimension coordinates before saving to netCDF.
What did you expect to happen?
Merging two datasets with the same coordinates and dimensions (but different data variables) should result in a single dataset with all of the data variables from the two datasets and exactly the same coordinates and dimensions.
Minimal Complete Verifiable Example
import numpy as np
import xarray as xr
from xarray.core.utils import Frozen
A = np.arange(4).reshape((2, 2))
B = np.arange(4).reshape((2, 2)) + 4
ds1 = xr.Dataset({"A": (["x", "y"], A), "B": (["x", "y"], B)}, coords={"x": ("x", [1, 2]), "y": ("y", [1, 2])})
ds2 = xr.Dataset({"C": (["x", "y"], A), "D": (["x", "y"], B)}, coords={"x": ("x", [1, 2]), "y": ("y", [1, 2])})
assert ds1.dims == Frozen({"x": 2, "y": 2})
assert ds2.dims == Frozen({"x": 2, "y": 2})
ds1_swap = ds1.rename_vars(y="z").swap_dims(y="z")
ds2_swap = ds2.rename_vars(y="z").swap_dims(y="z")
assert ds1_swap.dims == Frozen({"x": 2, "z": 2})
assert ds2_swap.dims == Frozen({"x": 2, "z": 2})
# merging makes the dimension "y" reappear (I would expect this assertion to fail):
assert xr.merge([ds1_swap, ds2_swap]).dims == Frozen({"x": 2, "z": 2, "y": 2})
# renaming and swapping after the merge causes issues later:
ds12 = xr.merge([ds1, ds2]).rename_vars(y="z").swap_dims(y="z")
ds3 = xr.Dataset({"E": (["x", "z"], A), "F": (["x", "z"], B)}, coords={"x": ("x", [1, 2]), "z": ("z", [1, 2])})
# ds12 and ds3 have the same dimensions:
assert ds12.dims == Frozen({"x": 2, "z": 2})
assert ds3.dims == Frozen({"x": 2, "z": 2})
# but merging brings back "y"
ds123 = xr.merge([ds12, ds3])
assert ds123.dims == Frozen({"x": 2, "z": 2, "y": 2})
# as do other operations:
ds12_as = ds12.assign_coords(x=(ds12.x + 1))
assert ds12_as.sizes == Frozen({"x": 2, "z": 2, "y": 2})
MVCE confirmation
- [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
- [X] Complete example — the example is self-contained, including all data and the text of any traceback.
- [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
- [X] New issue — a search of GitHub Issues suggests this is not a duplicate.
- [x] Recent environment — the issue occurs with the latest version of xarray and its dependencies.
Relevant log output
No response
Anything else we need to know?
No response
Environment
The MVCE works in all venvs I've tried including:
INSTALLED VERSIONS
commit: None python: 3.10.13 (main, Nov 10 2023, 15:02:19) [GCC 11.4.0] python-bits: 64 OS: Linux OS-release: 6.5.0-14-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: ('en_GB', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.9.3-development
xarray: 2023.11.0 pandas: 1.5.3 numpy: 1.26.2 scipy: 1.11.4 netCDF4: 1.6.5 pydap: None h5netcdf: 1.3.0 h5py: 3.10.0 Nio: None zarr: None cftime: 1.6.3 nc_time_axis: 1.4.1 iris: None bottleneck: None dask: 2023.12.0 distributed: None matplotlib: 3.8.2 cartopy: 0.22.0 seaborn: 0.13.0 numbagg: None fsspec: 2023.12.1 cupy: None pint: None sparse: 0.15.1 flox: None numpy_groupies: None setuptools: 69.0.2 pip: 23.3.1 conda: None pytest: 7.4.3 mypy: None IPython: 8.18.1 sphinx: None /home/brendan/Documents/inversions/.pymc_venv/lib/python3.10/site-packages/_distutils_hack/init.py:33: UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.")
INSTALLED VERSIONS
commit: None python: 3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 3.10.0-1160.81.1.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: ('en_GB', 'UTF-8') libhdf5: None libnetcdf: None
xarray: 2024.1.0 pandas: 2.2.0 numpy: 1.26.3 scipy: None netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: None nc_time_axis: None iris: None bottleneck: None dask: None distributed: None matplotlib: None cartopy: None seaborn: None numbagg: None fsspec: None cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 69.0.3 pip: 23.3.2 conda: None pytest: None mypy: None IPython: 8.18.1 sphinx: None
Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!
Thanks for the excellent issue @brendan-m-murphy .
This is really surprising. If we look at these two objects, there's no y
to be found:
[ins] In [3]: ds1_swap
Out[3]:
<xarray.Dataset>
Dimensions: (x: 2, z: 2)
Coordinates:
* x (x) int64 1 2
* z (z) int64 1 2
Data variables:
A (x, z) int64 0 1 2 3
B (x, z) int64 4 5 6 7
[ins] In [4]: ds2_swap
Out[4]:
<xarray.Dataset>
Dimensions: (x: 2, z: 2)
Coordinates:
* x (x) int64 1 2
* z (z) int64 1 2
Data variables:
C (x, z) int64 0 1 2 3
D (x, z) int64 4 5 6 7
But then if we merge them, a y
dimension appears:
[ins] In [2]: xr.merge([ds1_swap, ds2_swap])
Out[2]:
<xarray.Dataset>
Dimensions: (x: 2, z: 2, y: 2) # where does this come from?
Coordinates:
* x (x) int64 1 2
* z (y) int64 1 2
Dimensions without coordinates: y
Data variables:
A (x, z) int64 0 1 2 3
B (x, z) int64 4 5 6 7
C (x, z) int64 0 1 2 3
D (x, z) int64 4 5 6 7
Where does the y
come from? Even if I look through the entries of ds1_swap.__slots__
, I can't find it (though I haven't looked exhaustively)
I think we should be quite worried about this — one of the wonderful things about xarray is that it's not doing surprising things — the repr of the object is effectively a full representation of it. But this violates that — somewhere there's a y
lurking! :)
IIRC @benbovy has helped fix similar things in the past, hope it's OK to tag you in case you have any ideas.
It might be worth looking at the ._indexes
property of the datasets obtained after rename_vars
and/or swap_dims
. If it returns a dict with a "y" item that could be the culprit (it shouldn't be there) causing merge
to recreate a coordinate from it.
I suspect a bug in swap_dims
in this (very?) specific case where for the old dimension ("y") there is an indexed, non-dimension coordinate that has the same name than the new dimension ("z"). Not sure, though. I'll investigate more later.
It might be worth looking at the
._indexes
property of the datasets obtained afterrename_vars
and/orswap_dims
.
Thanks for the response + the idea. It's indeed on the dim
:
[nav] In [32]: ds2_swap._indexes['z'].dim
Out[32]: 'y'
Should the z
& y
there ever be different? Unless we can define them in a single place, should we have a runtime check that they don't differ?
Good catch. A possible fix would be
class Dataset(...):
def swap_dims(self, ...):
...
for current_name, current_variable in self.variables.items():
...
if current_name in result_dims:
...
if current_name in self._indexes:
indexes[current_name] = self._indexes[current_name]
indexes[current_name].dim = dims_dict[current_name] # update the index dim here
variables[current_name] = var
else:
...
else:
...
However, this fix works only in the case of a single PandasIndex
. I suspect that swap_dims might also not work correctly for other edge cases involving more advanced (custom) indexes, so it might be a bit trickier to address properly.
Maybe it's time to deprecate swap_dims
? It has been suggested as TODO for a while... There are now working alternatives that are easier to understand (e.g., using non-dimension indexed coordinate and/or using rename_dims
or rename_vars
that preserve the index).
Is swap_dims
just a composition of rename_dims
and the new-ish set_xindex
and drop_index
, or does it do something else on top? If not, I agree we should deprecate it, or at least its current implementation.
To my knowledge it doesn't allow doing more than what we can do now with those other methods, it only adds some confusion :).
if these two:
ds = xr.Dataset(coords={"x": ("x", [0, 1]), "y": ("x", [-1, 1])})
ds.swap_dims({"x": "y"})
ds.rename_dims({"x": "y"}).set_xindex("y").drop_indexes(["x"])
do the same thing, could we just make the former use the latter as implementation (obviously adapted to allow renaming multiple dimensions at the same time)? I don't think we'd even need a deprecation cycle for that.
Ah yes probably, although not sure that rename_dims
should be always chained with set_xindex
and/or drop_indexes
depending on the cases. I'd be in favor to just drop swap_dims
(after deprecating it).
If you're not chaining set_xindex
and drop_indexes
then it would be better to do rename_dims
directly, no?
In any case, I'm just saying that if we are to drop swap_dims
we should be able to point to something that does exactly the same thing. set_xindex
is a bit limited in that it only allow creating an index for one coordinate per call, so as far as I can tell we'd need to point to something using assign_coords
and xr.Coordinates
. However, with that my worry is that that would be a lot more to write.
In my case, I'm using swap_dims
instead of rename_dims
because of the rename_vars
call before (coord y
to z
precludes renaming dim y
to z
). Also, I think the "reappearing dimension" problem doesn't happen if I only call swap_dims
, e.g. if there is a coordinate lat
with dimension latitude
and I swap the dimension latitude
to lat
, this seems to work fine.
Anyway, it doesn't seem like you'd be able to replace swap_dims
with rename_dims
plus set_xindex
and drop_indexes
in this case.
interestingly, rename_dims
→ rename_vars
works, but not the other way around (it suggests to use swap_dims
instead). This, again, is a relic of how xarray
used to work in earlier versions, and we should probably remove that error.
Anecdotally, I've seen swap_dims
used a bunch so I support reimplementing it in terms of more atomic operations
interestingly,
rename_dims
→rename_vars
works, but not the other way around (it suggests to useswap_dims
instead). This, again, is a relic of howxarray
used to work in earlier versions, and we should probably remove that error.
Ah okay, I probably tried rename_vars
then rename_dims
and followed the suggestion to use swap_dims
.
As you say, for
ds = xr.Dataset(coords={"x": ("x", [0, 1])})
ds1 = ds.rename_dims({"x": "y"}).rename_vars({"x": "y"})
ds2 = ds.rename_vars({"x": "y"}).swap_dims({"x": "y"})
I get the expected result when I merge ds
and ds1
In [11]: xr.merge([ds, ds1])
Out[11]:
<xarray.Dataset>
Dimensions: (x: 2, y: 2)
Coordinates:
* x (x) int64 0 1
* y (y) int64 0 1
Data variables:
*empty*
but the dimension of y
is reverted to x
if I merge ds
and ds2
:
In [10]: xr.merge([ds, ds2])
Out[10]:
<xarray.Dataset>
Dimensions: (x: 2)
Coordinates:
* x (x) int64 0 1
* y (x) int64 0 1
Data variables:
*empty*
(Also I just realised ds.rename({"x": "y"})
would work as well. I'm not sure why I didn't try that.)
I might miss the high-level picture here.
I thought that swap_dims
was mostly used as a workaround to (temporarily?) perform label-based operations (selection, alignment, etc) using another existing coordinate than the dimension coordinate.
This workaround is not needed anymore since we support indexes for non-dimension coordinates. And for renaming a dimension coordinate there is indeed .rename()
that is more explicit.
@dcherian did you see swap_dims l
used for other applications?
I've tried re-implementing swap_dims
it in terms of more atomic operations in #8911, but unfortunately it cannot handle the pandas multi-index special case. I think this cannot be easily done until we fully drop the implicit creation of coordinates from a pd.MultiIndex
(#8140).
Regarding this issue, the other possible fix in https://github.com/pydata/xarray/issues/8646#issuecomment-1910293215 is limited to pandas single indexes.
As an alternative to those fixes and deprecating swap_dims
, I wonder if we shouldn't limit it to the case of dimension coordinates, i.e.,
ds = xr.Dataset(coords={"x": ("x", [0, 1])})
renamed = ds.rename_vars({"x": "y"})
renamed.swap_dims({"x": "y"})
# ValueError: swap_dims only works with dimension coordinates but found coordinate "y" with dimension "x"
This will not fix the issue here but at least it will provide a clear error message instead of weird issues (old dimension appearing from nowhere) later on. Workarounds exist like .rename()
and we should also allow chaining rename_vars
-> rename_dims
(implemented in #8911).