xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Dataset.from_dataframe: deprecate expanding the multi-index

Open benbovy opened this issue 2 years ago • 4 comments

What is your issue?

Let's continue here the discussion about changing the behavior of Dataset.from_dataframe (see https://github.com/pydata/xarray/pull/8140#issuecomment-1712485626).

The current behaviour of Dataset.from_dataframe where it always unstacks feels wrong to me. To me, it seems sensible that Dataset.from_dataframe(df) automatically creates a Dataset with PandasMultiIndex if df has a MultiIndex. The user can then use that or quite easily unstack to a dense or sparse array.

If we don't unstack anymore the multi-index in Dataset.from_dataframe, are we OK that the "Dataset -> DataFrame -> Dataset" round-trip will not yield expected results unless we unstack explicitly?

ds = xr.Dataset(
    {"foo": (("x", "y"), [[1, 2], [3, 4]])},
    coords={"x": ["a", "b"], "y": [1, 2]},
)

df = ds.to_dataframe()
ds2 = xr.Dataset.from_dataframe(df, dim="z")

ds2.identical(ds)  # False

ds2.unstack("z").identical(ds)  # True

cc @max-sixty @dcherian

benbovy avatar Sep 10 '23 15:09 benbovy

That's a good point, and these invariants are indeed nice to uphold.

Is there a branch with the dim= code on? Or it's just a mental model atm? (I wrote a message but not sure it's correct so removed it, will rewrite with either the code or more thought!)

max-sixty avatar Sep 10 '23 18:09 max-sixty

Sorry I wasn't very clear in that thread.

I think we should avoid the dim argument for this reason.

We could just use "dim_X" if Index.name is None, and have the user manually rename to a name they like.

dcherian avatar Sep 11 '23 03:09 dcherian

Is there a branch with the dim= code on?

See #8170

benbovy avatar Sep 11 '23 06:09 benbovy

Without any magical ideas for maintaining the from_dataframe / to_dataframe round-trip, I would be +1 on deprecating unstacking / expanding the multi-index; to the extent it helps us with finishing off the index refactor and fixing bugs such as https://github.com/pydata/xarray/issues/8646.

(personally I don't even use from_dataframe, I just do xr.Dataset(df), which doesn't unstack... So this would also have the advantage of unifying that behavior...)

max-sixty avatar Oct 19 '24 18:10 max-sixty