xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Duplicate values on unstack

Open znichollscr opened this issue 2 years ago • 3 comments

What happened?

I unstacked a dataset and got values I didn't expect. It turns out that, when unstacking, my dataset had multiple values for the same index. This is clearly a case of user error, but it silently passed.

What did you expect to happen?

A warning or error would be raised to say, "this isn't going to work".

Minimal Complete Verifiable Example

import datetime as dt
import xarray as xr


ds = xr.DataArray(
    [[1, 2, 3], [4, 5, 6]],
    dims=("lat", "time"),
    coords={"lat": [-60, 60], "time": [dt.datetime(2010, 1, d) for d in range(1, 4)]},
    name="test",
).to_dataset()

ds = (
    ds.assign_coords(
        {
            "month": ds["time"].dt.month,
            "year": ds["time"].dt.year,
        }
    )
    .set_index(time=["month", "year"])
)
ds = ds.unstack("time")

# the output only has 2 values, which isn't what I expected
ds["test"].data

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

It's not clear to me where the error is. It might just be that this particular order of operations leads to a case that isn't otherwise caught. Looking at intermediate output, I thought the error was in unstack but maybe it's more complex than that...

Environment

INSTALLED VERSIONS

commit: e678a1d7884a3c24dba22d41b2eef5d7fe5258e7 python: 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:14) [Clang 12.0.1 ] python-bits: 64 OS: Darwin OS-release: 21.5.0 machine: arm64 processor: arm byteorder: little LC_ALL: None LANG: en_AU.UTF-8 LOCALE: ('en_AU', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.8.1

xarray: 0.1.dev4312+ge678a1d.d20220928 pandas: 1.5.0 numpy: 1.22.4 scipy: 1.9.1 netCDF4: 1.6.1 pydap: installed h5netcdf: 1.0.2 h5py: 3.7.0 Nio: None zarr: 2.13.2 cftime: 1.6.2 nc_time_axis: 1.4.1 PseudoNetCDF: 3.2.2 rasterio: 1.3.1 cfgrib: 0.9.10.1 iris: 3.3.0 bottleneck: 1.3.5 dask: 2022.9.1 distributed: 2022.9.1 matplotlib: 3.6.0 cartopy: 0.21.0 seaborn: 0.12.0 numbagg: 0.2.1 fsspec: 2022.8.2 cupy: None pint: 0.19.2 sparse: 0.13.0 flox: 0.5.9 numpy_groupies: 0.9.19 setuptools: 65.4.0 pip: 22.2.2 conda: None pytest: 7.1.3 IPython: 8.5.0 sphinx: None

znichollscr avatar Sep 29 '22 04:09 znichollscr

Thanks for the report @znichollscr.

Maybe we should check pandas.MultiIndex.is_unique in Dataset.unstack() like in Dataset.from_dataframe()?

df = ds.drop_vars("lat").to_dataframe()

xr.Dataset.from_dataframe(df)
# ValueError: cannot convert a DataFrame with a non-unique MultiIndex into xarray

benbovy avatar Sep 29 '22 09:09 benbovy

Maybe we should check pandas.MultiIndex.is_unique in Dataset.unstack()

Better to check this in PandasMultiIndex.unstack() actually.

benbovy avatar Sep 29 '22 09:09 benbovy

Ok great thanks, solutions sound good

znicholls avatar Sep 30 '22 02:09 znicholls

I just stumbled over this and opened #8737. Happy to get a review.

mathause avatar Feb 12 '24 15:02 mathause