xarray
xarray copied to clipboard
Strings in coordinates may be truncated when saving concatenated rasters to zarr
What happened?
I would like to concatenate two dataarrays or two datasets, save them to zarr, read them into xarray again at a later time, and have all the coordinates intact.
However, if I concatenate two dataarrays or datasets along a string-based coordinate, if the first dataarray in the concatenation has a coordinate with shorter maximum string length than the second dataarray, saving the concatenated dataset to zarr can truncate the string.
This problem only arises when:
- First dataarray's string coordinate has an
encoding["dtype"]attribute - First dataarray's string coordinate length is shorter than the second dataarray's.
What did you expect to happen?
I expect the xarray dataset that was read from the zarr to have the string coordinate the same as the input dataset that I wrote in, with no truncation.
Minimal Complete Verifiable Example
import logging
import tempfile
from pathlib import Path
import dask.array as da
import numpy as np
import xarray as xr
logger = logging.getLogger(__name__)
def mb_raster_with_custom_band_names(bands: list[str]) -> xr.DataArray:
"""
Generates a multi-band raster with custom band names.
This function creates a multi-band raster with a specified list of band names. The raster is generated with random
data and has a shape of (len(bands), 100, 100). The raster is assigned the EPSG:32733 coordinate reference system.
:param bands: A list of strings specifying the band names for the raster.
:return: A multi-band raster with custom band names as an xarray DataArray.
"""
n_x = 100
n_y = 100
data_with_custom_band_names = xr.DataArray(
(da.random.random_sample(size=(len(bands), n_x, n_y)) * 255).astype("float32"),
coords={"band": bands, "y": np.arange(n_y), "x": np.arange(n_x)},
)
return data_with_custom_band_names
def zarr_round_trip(raster: xr.DataArray | xr.Dataset) -> xr.DataArray:
"""Simple function to test the round trip of a raster to zarr and back."""
with tempfile.TemporaryDirectory() as tmpdirname:
tmp_zarr_path = Path(tmpdirname) / "tmp_zarr.zarr"
raster.to_zarr(tmp_zarr_path, mode="w")
round_trip_raster = xr.open_dataset(tmp_zarr_path, engine="zarr")
return round_trip_raster
if __name__ == "__main__":
logging.basicConfig(level=logging.INFO)
# Make a multi-band raster with short band names.
rgb_raster = mb_raster_with_custom_band_names(["blue", "green", "red"])
# Force the band encoding dtype to be equal to the band value dtype. This is what was triggering the band name
# truncation issue when saving to zarr.
rgb_raster.band.encoding["dtype"] = rgb_raster.band.dtype
# Make another multi-band raster with long band names.
long_rgb_raster = mb_raster_with_custom_band_names(["the_sky_is_blue", "the_grass_is_green", "the_rose_is_red"])
long_rgb_raster.band.encoding["dtype"] = long_rgb_raster.band.dtype
# concat two rasters with xr_concat (short band name raster first)
concat_raster_short_first = xr.concat([rgb_raster, long_rgb_raster], dim="band")
logger.info(f"{concat_raster_short_first.band.values=}")
# array(['blue', 'green', 'red', 'the_sky_is_blue', 'the_grass_is_green', 'the_rose_is_red'], dtype='<U18')
logger.info(f"{concat_raster_short_first.band.dtype=}")
# dtype('<U18')
logger.info(f"{concat_raster_short_first.band.encoding['dtype']=}")
# dtype('<U5')
concat_raster_short_first_reread = zarr_round_trip(concat_raster_short_first)
logger.info(f"{concat_raster_short_first_reread.band.values=}")
# array(['blue', 'green', 'red', 'the_s', 'the_g', 'the_r'], dtype='<U5')
logger.info(f"{concat_raster_short_first_reread.band.dtype=}")
# dtype('<U5')
logger.info(f"{concat_raster_short_first_reread.band.encoding['dtype']=}")
# dtype('<U5')
MVCE confirmation
- [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
- [X] Complete example — the example is self-contained, including all data and the text of any traceback.
- [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
- [X] New issue — a search of GitHub Issues suggests this is not a duplicate.
- [X] Recent environment — the issue occurs with the latest version of xarray and its dependencies.
Relevant log output
INFO:__main__:concat_raster_short_first.band.values=array(['blue', 'green', 'red', 'the_sky_is_blue', 'the_grass_is_green',
'the_rose_is_red'], dtype='<U18')
INFO:__main__:concat_raster_short_first.band.dtype=dtype('<U18')
INFO:__main__:concat_raster_short_first.band.encoding['dtype']=dtype('<U5')
INFO:__main__:concat_raster_short_first_reread.band.values=array(['blue', 'green', 'red', 'the_s', 'the_g', 'the_r'], dtype='<U5')
INFO:__main__:concat_raster_short_first_reread.band.dtype=dtype('<U5')
INFO:__main__:concat_raster_short_first_reread.band.encoding['dtype']=dtype('<U5')
Anything else we need to know?
The order of the dataarray in xr.concat mattered. If you run xr.concat([rgb_raster, long_rgb_raster], dim="band"), the truncation will happen; but if you run xr.concat([long_rgb_raster, rgb_raster], dim="band") , it will not.
I attempted to unwrap the call chain when we save a dataset or dataarray to zarr:
- In the
xarray.DataArray.to_zarr()function,xarray.DataArrayis implicitly converted toxarray.Datasetbefore we save it to zarr: code. - Because we’re calling
xarray.DataArray.to_dataset()directly with no additional argument, it callsxarray.DataArray._to_dataset_whole(). - In xarray.DataArray._to_dataset_whole(), the coordinates attribute (a private attr called
._coords) of the dataarray is implicitly copied to the variable of the dataset. So whatever was in our dataset variables, it came from the ._coords private attr of the datarray. - The next clue as to how the dataset variables get saved to zarr is in xarray.backends.ZarrStore. Specifically, the dataset variable go through this encode_zarr_variable function. That function in turn calls xarray.conventions.encode_cf_variable. In there, the function just uses whatever the variable encoding was in the dataset, which came from the coordinate encoding from the data array.
Looks like a similar issue was handled in 2014 (https://github.com/pydata/xarray/issues/217); this bug appears to a corner case of when the encoding attr of a particular coordinate was left intact during concatenation.
Environment
INSTALLED VERSIONS
commit: None python: 3.11.8 (main, Mar 12 2024, 11:52:02) [GCC 12.2.0] python-bits: 64 OS: Linux OS-release: 6.6.26-linuxkit machine: x86_64 processor: byteorder: little LC_ALL: C.UTF-8 LANG: C.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.9.1
xarray: 2024.5.0 pandas: 2.1.4 numpy: 1.23.5 scipy: 1.10.1 netCDF4: 1.6.3 pydap: None h5netcdf: 1.1.0 h5py: 3.8.0 zarr: 2.16.1 cftime: 1.6.2 nc_time_axis: None iris: None bottleneck: 1.3.7 dask: 2024.4.2 distributed: 2024.4.2 matplotlib: 3.7.1 cartopy: 0.22.0 seaborn: 0.12.2 numbagg: 0.2.2 fsspec: 2024.2.0 cupy: None pint: None sparse: 0.13.0 flox: 0.6.10 numpy_groupies: 0.9.20 setuptools: 69.0.3 pip: 23.3.2 conda: None pytest: 7.4.3 mypy: 1.8.0 IPython: 8.12.0 sphinx: 7.2.6
Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!
That does look confusing & frustrating.
A workaround would be .drop_encoding(). (When I'm using xarray for my own work, I generally always do this, but not sure that's necessarily correct...)
@pydata/xarray do we have a view / plan for how to handle encoding? We've discussed removing it in the past. I think it's one of those issues that experienced people can get around, but trip up new users, which means it's generally underweight in how core team & contributors (who are more experienced than average) prioritize issues.
Is there a case for defaulting to dropping encoding by default on read?
This issue is also present when using to_netcdf using the netcdf4 and h5netcdf engines.
@max-sixty's suggestion to drop_encoding() before writing the file to disk also resolves the issue for me.