xarray
xarray copied to clipboard
Coordinate promotion workaround broken
What happened?
Ok so this one is a bit weird. I'm not sure this is a bug, but code that worked before doesn't anymore, so it is some sort of regression.
I have a dataset with one dimension and one coordinate along that one, but they have different names. I want to transform this so that the coordinate name becomes the dimension name so it becomes are proper dimension-coordinate (I don't know how to call it). After renaming the dim to the coord's name, it all looks good in the repr, but the coord still is missing an index for that dimension (crd.indexes is empty, see MCVE). There was a workaround through reset_coords for this, but it doesn't work anymore.
Instead, the last line of the MCVE downgrades the variable, the final lon doesn't have coords anymore.
What did you expect to happen?
In the MCVE below, I show what the old "workaround" was. I expected lon.indexes to contain the indexes lon at the end of the procedure.
Minimal Complete Verifiable Example
import xarray as xr
# A dataset with a 1d variable along a dimension
ds = xr.Dataset({'lon': xr.DataArray([1, 2, 3], dims=('x',))})
# Promote to coord. This still is not a proper crd-dim (different name)
ds = ds.set_coords(['lon'])
# Rename dim:
ds = ds.rename(x='lon')
# Now do we have a proper coord-dim ? No. not yet because:
ds.indexes # is empty
# Workaround that was used up to the last release
lon = ds.lon.reset_coords(drop=True)
# Because of the missing indexes the next line fails on the master
lon - lon.diff('lon')
MVCE confirmation
- [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
- [X] Complete example — the example is self-contained, including all data and the text of any traceback.
- [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
- [x] New issue — a search of GitHub Issues suggests this is not a duplicate.
Relevant log output
No response
Anything else we need to know?
My guess is that this line is causing reset_coords to drop the coordinate from itself : https://github.com/pydata/xarray/blob/c34ef8a60227720724e90aa11a6266c0026a812a/xarray/core/dataarray.py#L866
It would be nice if the renaming was sufficient for the indexes to appear.
My example is weird I know. The real use case is a script where we receive a 2d coordinate but where all lines are the same, so we take the first line and promote it to a proper coord-dim. But the current code fails on the master on the lon - lon.diff('lon') step that happens afterwards.
Environment
INSTALLED VERSIONS
commit: None python: 3.9.12 | packaged by conda-forge | (main, Mar 24 2022, 23:22:55) [GCC 10.3.0] python-bits: 64 OS: Linux OS-release: 5.13.19-2-MANJARO machine: x86_64 processor: byteorder: little LC_ALL: None LANG: fr_CA.UTF-8 LOCALE: ('fr_CA', 'UTF-8') libhdf5: None libnetcdf: None
xarray: 2022.3.1.dev104+gc34ef8a6 pandas: 1.4.2 numpy: 1.22.2 scipy: 1.8.0 netCDF4: None pydap: installed h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.5.2 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2022.02.1 distributed: 2022.2.1 matplotlib: None cartopy: None seaborn: None numbagg: None fsspec: 2022.3.0 cupy: None pint: None sparse: 0.13.0 setuptools: 59.8.0 pip: 22.0.3 conda: None pytest: 7.0.1 IPython: 8.3.0 sphinx: None
this is a known issue, and one that we'd like to clean up (see #4825 for discussion). The short answer is that you should use swap_dims instead of rename:
ds.swap_dims({"x": "lon"})
@shoyer This was the regression I ran in to. We could raise an error asking the user to switch to swap_dims.
x is unindexed while lon is a coordinate variable. Then
ds = ds.rename(x='lon')
makes lon a dimension coordinate (though there is no entry in ._indexes)
We could raise an error asking the user to switch to
swap_dims.
This seems like a good idea
In the long term, we like to decouple indexes from coordinate, and make something like the following work:
ds.set_coords(['lon']).rename(x='lon').set_index('lon')
We could raise an error asking the user to switch to swap_dims.
Shouldn't we raise a warning instead? There may be relevant use cases like the example above (at least in the long term) where an index is not really needed?