Dataset.from_dataframe: deprecate expanding the multi-index
What is your issue?
Let's continue here the discussion about changing the behavior of Dataset.from_dataframe (see https://github.com/pydata/xarray/pull/8140#issuecomment-1712485626).
The current behaviour of Dataset.from_dataframe where it always unstacks feels wrong to me. To me, it seems sensible that Dataset.from_dataframe(df) automatically creates a Dataset with PandasMultiIndex if df has a MultiIndex. The user can then use that or quite easily unstack to a dense or sparse array.
If we don't unstack anymore the multi-index in Dataset.from_dataframe, are we OK that the "Dataset -> DataFrame -> Dataset" round-trip will not yield expected results unless we unstack explicitly?
ds = xr.Dataset(
{"foo": (("x", "y"), [[1, 2], [3, 4]])},
coords={"x": ["a", "b"], "y": [1, 2]},
)
df = ds.to_dataframe()
ds2 = xr.Dataset.from_dataframe(df, dim="z")
ds2.identical(ds) # False
ds2.unstack("z").identical(ds) # True
cc @max-sixty @dcherian
That's a good point, and these invariants are indeed nice to uphold.
Is there a branch with the dim= code on? Or it's just a mental model atm? (I wrote a message but not sure it's correct so removed it, will rewrite with either the code or more thought!)
Sorry I wasn't very clear in that thread.
I think we should avoid the dim argument for this reason.
We could just use "dim_X" if Index.name is None, and have the user manually rename to a name they like.
Is there a branch with the dim= code on?
See #8170
Without any magical ideas for maintaining the from_dataframe / to_dataframe round-trip, I would be +1 on deprecating unstacking / expanding the multi-index; to the extent it helps us with finishing off the index refactor and fixing bugs such as https://github.com/pydata/xarray/issues/8646.
(personally I don't even use from_dataframe, I just do xr.Dataset(df), which doesn't unstack... So this would also have the advantage of unifying that behavior...)