xarray icon indicating copy to clipboard operation
xarray copied to clipboard

String type casting error during concatenating

Open Pietervanhalem opened this issue 1 month ago • 2 comments

What happened?

I have a very large number of data sets (6840) that I want to concat over 3 dimensions. An example of one of the dataset is showed below:

Image

I concat with:

        (
            xr.concat(dss, "case")
            .set_index(dict(case=["wave_seed", "site_id", "damping_case"]))
            .unstack()
        )

after concatenating my dataset looks like this:

Image

Note that 'NC100 (OSS)' has tuned into 'NC100' and 'base_case' has turned into 'base_c'. This is due to truncation since the dtype also change from U9 and U11 to U5 and U6. If I add the following before concating this issue is resolved.

                ds['site_id'] = ds['site_id'].astype("U12")
                ds['damping_case'] = ds['damping_case'].astype("U10")

I cannot reproduce with a smaler number of datasets so not sure how to reproduce this bug.

What did you expect to happen?

No response

Minimal Complete Verifiable Example

# /// script
# requires-python = ">=3.11"
# dependencies = [
#   "xarray[complete]@git+https://github.com/pydata/xarray.git@main",
# ]
# ///
#
# This script automatically imports the development branch of xarray to check for issues.
# Please delete this header if you have _not_ tested this script with `uv run`!

import xarray as xr
xr.show_versions()
# your reproducer code ...

Steps to reproduce

No response

MVCE confirmation

  • [x] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [ ] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [ ] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [x] New issue — a search of GitHub Issues suggests this is not a duplicate.
  • [x] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output


Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None python: 3.13.3 (tags/v3.13.3:6280bb5, Apr 8 2025, 14:47:33) [MSC v.1943 64 bit (AMD64)] python-bits: 64 OS: Windows OS-release: 11 machine: AMD64 processor: Intel64 Family 6 Model 186 Stepping 2, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: ('English_United States', '1252') libhdf5: 1.14.6 libnetcdf: None

xarray: 2025.6.0 pandas: 2.3.0 numpy: 2.2.6 scipy: 1.15.3 netCDF4: None pydap: None h5netcdf: 1.6.1 h5py: 3.14.0 zarr: 3.0.8 cftime: None nc_time_axis: None iris: None bottleneck: 1.5.0 dask: 2025.5.1 distributed: 2025.5.1 matplotlib: 3.10.3 cartopy: None seaborn: 0.13.2 numbagg: 0.9.0 fsspec: 2025.5.1 cupy: None pint: 0.24.4 sparse: None flox: 9.11 numpy_groupies: 0.11.3 setuptools: 80.9.0 pip: 25.1.1 conda: None pytest: 8.0.2 mypy: 1.16.0 IPython: 9.3.0 sphinx: 8.3.0

Pietervanhalem avatar Dec 02 '25 09:12 Pietervanhalem

Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!

welcome[bot] avatar Dec 02 '25 09:12 welcome[bot]

It looks like one of the operations (concat, set_index, unstack) calculates the new width wrongly. Could you check if concat on its own or with only 2 stacked variables you still get the same error? Otherwise we might need access to the data files somehow (only the coordinates are really important, though).

Either way, I think you should be able to avoid the bug entirely by enforcing the np.dtypes.StringDtype: np.array(..., dtype=np.dtypes.StringDType()) (or use astype with that dtype). That allows variable-width strings, which should be much easier to use.

keewis avatar Dec 03 '25 23:12 keewis