xarray Strings in coordinates may be truncated when saving concatenated rasters to zarr

trafficstars

What happened?

I would like to concatenate two dataarrays or two datasets, save them to zarr, read them into xarray again at a later time, and have all the coordinates intact.

However, if I concatenate two dataarrays or datasets along a string-based coordinate, if the first dataarray in the concatenation has a coordinate with shorter maximum string length than the second dataarray, saving the concatenated dataset to zarr can truncate the string.

This problem only arises when:

First dataarray's string coordinate has an encoding["dtype"] attribute
First dataarray's string coordinate length is shorter than the second dataarray's.

What did you expect to happen?

I expect the xarray dataset that was read from the zarr to have the string coordinate the same as the input dataset that I wrote in, with no truncation.

Minimal Complete Verifiable Example

import logging
import tempfile
from pathlib import Path

import dask.array as da
import numpy as np
import xarray as xr

logger = logging.getLogger(__name__)


def mb_raster_with_custom_band_names(bands: list[str]) -> xr.DataArray:
    """
    Generates a multi-band raster with custom band names.

    This function creates a multi-band raster with a specified list of band names. The raster is generated with random
    data and has a shape of (len(bands), 100, 100). The raster is assigned the EPSG:32733 coordinate reference system.

    :param bands: A list of strings specifying the band names for the raster.
    :return: A multi-band raster with custom band names as an xarray DataArray.
    """
    n_x = 100
    n_y = 100

    data_with_custom_band_names = xr.DataArray(
        (da.random.random_sample(size=(len(bands), n_x, n_y)) * 255).astype("float32"),
        coords={"band": bands, "y": np.arange(n_y), "x": np.arange(n_x)},
    )
    return data_with_custom_band_names


def zarr_round_trip(raster: xr.DataArray | xr.Dataset) -> xr.DataArray:
    """Simple function to test the round trip of a raster to zarr and back."""
    with tempfile.TemporaryDirectory() as tmpdirname:
        tmp_zarr_path = Path(tmpdirname) / "tmp_zarr.zarr"
        raster.to_zarr(tmp_zarr_path, mode="w")

        round_trip_raster = xr.open_dataset(tmp_zarr_path, engine="zarr")

    return round_trip_raster


if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO)

    # Make a multi-band raster with short band names.
    rgb_raster = mb_raster_with_custom_band_names(["blue", "green", "red"])
    # Force the band encoding dtype to be equal to the band value dtype. This is what was triggering the band name
    # truncation issue when saving to zarr.
    rgb_raster.band.encoding["dtype"] = rgb_raster.band.dtype

    # Make another multi-band raster with long band names.
    long_rgb_raster = mb_raster_with_custom_band_names(["the_sky_is_blue", "the_grass_is_green", "the_rose_is_red"])
    long_rgb_raster.band.encoding["dtype"] = long_rgb_raster.band.dtype

    # concat two rasters with xr_concat (short band name raster first)
    concat_raster_short_first = xr.concat([rgb_raster, long_rgb_raster], dim="band")

    logger.info(f"{concat_raster_short_first.band.values=}")
    # array(['blue', 'green', 'red', 'the_sky_is_blue', 'the_grass_is_green', 'the_rose_is_red'], dtype='<U18')

    logger.info(f"{concat_raster_short_first.band.dtype=}")
    # dtype('<U18')

    logger.info(f"{concat_raster_short_first.band.encoding['dtype']=}")
    # dtype('<U5')

    concat_raster_short_first_reread = zarr_round_trip(concat_raster_short_first)
    logger.info(f"{concat_raster_short_first_reread.band.values=}")
    # array(['blue', 'green', 'red', 'the_s', 'the_g', 'the_r'], dtype='<U5')

    logger.info(f"{concat_raster_short_first_reread.band.dtype=}")
    # dtype('<U5')

    logger.info(f"{concat_raster_short_first_reread.band.encoding['dtype']=}")
    # dtype('<U5')

MVCE confirmation

[X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
[X] Complete example — the example is self-contained, including all data and the text of any traceback.
[X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
[X] New issue — a search of GitHub Issues suggests this is not a duplicate.
[X] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

INFO:__main__:concat_raster_short_first.band.values=array(['blue', 'green', 'red', 'the_sky_is_blue', 'the_grass_is_green',
       'the_rose_is_red'], dtype='<U18')
INFO:__main__:concat_raster_short_first.band.dtype=dtype('<U18')
INFO:__main__:concat_raster_short_first.band.encoding['dtype']=dtype('<U5')
INFO:__main__:concat_raster_short_first_reread.band.values=array(['blue', 'green', 'red', 'the_s', 'the_g', 'the_r'], dtype='<U5')
INFO:__main__:concat_raster_short_first_reread.band.dtype=dtype('<U5')
INFO:__main__:concat_raster_short_first_reread.band.encoding['dtype']=dtype('<U5')

Anything else we need to know?

The order of the dataarray in xr.concat mattered. If you run xr.concat([rgb_raster, long_rgb_raster], dim="band"), the truncation will happen; but if you run xr.concat([long_rgb_raster, rgb_raster], dim="band") , it will not.

I attempted to unwrap the call chain when we save a dataset or dataarray to zarr:

In the xarray.DataArray.to_zarr() function, xarray.DataArray is implicitly converted to xarray.Dataset before we save it to zarr: code.
Because we’re calling xarray.DataArray.to_dataset() directly with no additional argument, it calls xarray.DataArray._to_dataset_whole().
In xarray.DataArray._to_dataset_whole(), the coordinates attribute (a private attr called ._coords) of the dataarray is implicitly copied to the variable of the dataset. So whatever was in our dataset variables, it came from the ._coords private attr of the datarray.
The next clue as to how the dataset variables get saved to zarr is in xarray.backends.ZarrStore. Specifically, the dataset variable go through this encode_zarr_variable function. That function in turn calls xarray.conventions.encode_cf_variable. In there, the function just uses whatever the variable encoding was in the dataset, which came from the coordinate encoding from the data array.

Looks like a similar issue was handled in 2014 (https://github.com/pydata/xarray/issues/217); this bug appears to a corner case of when the encoding attr of a particular coordinate was left intact during concatenation.

Environment

INSTALLED VERSIONS

commit: None python: 3.11.8 (main, Mar 12 2024, 11:52:02) [GCC 12.2.0] python-bits: 64 OS: Linux OS-release: 6.6.26-linuxkit machine: x86_64 processor: byteorder: little LC_ALL: C.UTF-8 LANG: C.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.9.1

xarray: 2024.5.0 pandas: 2.1.4 numpy: 1.23.5 scipy: 1.10.1 netCDF4: 1.6.3 pydap: None h5netcdf: 1.1.0 h5py: 3.8.0 zarr: 2.16.1 cftime: 1.6.2 nc_time_axis: None iris: None bottleneck: 1.3.7 dask: 2024.4.2 distributed: 2024.4.2 matplotlib: 3.7.1 cartopy: 0.22.0 seaborn: 0.12.2 numbagg: 0.2.2 fsspec: 2024.2.0 cupy: None pint: None sparse: 0.13.0 flox: 0.6.10 numpy_groupies: 0.9.20 setuptools: 69.0.3 pip: 23.3.2 conda: None pytest: 7.4.3 mypy: 1.8.0 IPython: 8.12.0 sphinx: 7.2.6

May 20 '24 20:05 ikding

Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!

May 20 '24 20:05 welcome[bot]

That does look confusing & frustrating.

A workaround would be .drop_encoding(). (When I'm using xarray for my own work, I generally always do this, but not sure that's necessarily correct...)

@pydata/xarray do we have a view / plan for how to handle encoding? We've discussed removing it in the past. I think it's one of those issues that experienced people can get around, but trip up new users, which means it's generally underweight in how core team & contributors (who are more experienced than average) prioritize issues.

Is there a case for defaulting to dropping encoding by default on read?

May 20 '24 20:05 max-sixty

This issue is also present when using to_netcdf using the netcdf4 and h5netcdf engines.

@max-sixty's suggestion to drop_encoding() before writing the file to disk also resolves the issue for me.

Aug 15 '24 18:08 zhiweigan

xarray xarray copied to clipboard

Strings in coordinates may be truncated when saving concatenated rasters to zarr

What happened?

What did you expect to happen?

Minimal Complete Verifiable Example

MVCE confirmation

Relevant log output

Anything else we need to know?

Environment

INSTALLED VERSIONS

xarray
xarray copied to clipboard