xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Cannot export dataset with categorical index in 2025.4.0

Open mancellin opened this issue 6 months ago • 4 comments

What happened?

In 2025.4.0 and on the current master, trying to export to netCDF a dataset created from a dataframe with categorical index raises the error:

TypeError: Cannot interpret 'CategoricalDtype(categories=['C1', 'C2'], ordered=True, categories_dtype=object)' as a data type

What did you expect to happen?

In 2025.3.1 and before, it was possible to export such a dataset (although the categorical index might be lost in the process).

Minimal Complete Verifiable Example

import pandas as pd
import xarray as xr

df = pd.DataFrame([{"ind": "C1", "val": 1.0}, {"ind": "C2", "val": 2.0}]).set_index("ind")
df.index = df.index.astype(pd.CategoricalDtype(categories=["C1", "C2"], ordered=True))
ds = df.to_xarray()

ds.to_netcdf("foo.nc")

MVCE confirmation

  • [x] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [x] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [x] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [x] New issue — a search of GitHub Issues suggests this is not a duplicate.
  • [x] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output


Anything else we need to know?

Might be related to #10301.

Arguably, the new behavior is better than silently converting to another type. But then, the changelog of 2025.4.0 might need a bit more information on how to update your code for this new behavior.

(Cross-ref: https://github.com/capytaine/capytaine/issues/683)

Environment

``` INSTALLED VERSIONS ------------------ commit: None python: 3.10.12 (main, Feb 4 2025, 14:57:36) [GCC 11.4.0] python-bits: 64 OS: Linux OS-release: 6.8.0-59-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: fr_FR.UTF-8 LOCALE: ('fr_FR', 'UTF-8') libhdf5: 1.14.2 libnetcdf: 4.9.4-development

xarray: 2025.4.1.dev16+gc8affb3c pandas: 2.2.3 numpy: 2.2.5 scipy: 1.15.2 netCDF4: 1.7.2 pydap: None h5netcdf: None h5py: 3.13.0 zarr: None cftime: 1.6.4.post1 nc_time_axis: None iris: None bottleneck: None dask: None distributed: None matplotlib: 3.10.0 cartopy: None seaborn: None numbagg: None fsspec: 2025.3.2 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: None pip: 25.0.1 conda: None pytest: 8.3.4 mypy: None IPython: 8.32.0 sphinx: 8.1.3

</details>

mancellin avatar May 13 '25 08:05 mancellin

Yes this hasn't been built yet. We could use either netcdf enums or the CF flag variable conventions for this. The latter generalizes across array formats so would be good to do that by default I think.

dcherian avatar May 13 '25 14:05 dcherian

as of #9671, xarray supports extension array indexes as well. So those go into the xarray object untouched and then they are being (attempted to be) written to disk, but it seems that netcdf writing lacks support for them.

ilan-gold avatar May 14 '25 08:05 ilan-gold

Previously, these were just thrown into numpy object dtype containers once they crossed from pandas to xarray, and were then written as fixed sized strings. Quite a departure from the original position, but now we have to deal with handling the original data type.

ilan-gold avatar May 14 '25 08:05 ilan-gold

Use ds.as_numpy() to recover previous behaviour.

dcherian avatar May 27 '25 17:05 dcherian