xarray icon indicating copy to clipboard operation
xarray copied to clipboard

to_zarr() does not maintain time encoding when appending to an existing store

Open lrntct opened this issue 6 months ago • 6 comments

What happened?

When writing a dataset to a zarr store, the time encoding is correctly stored and retrieved. However, when appending to an existing store, it is not possible to set the encoding, and the current encoding is not applied to the appended data.

What did you expect to happen?

I expected the encoding to be consistent between the first write and the subsequent ones.

Minimal Complete Verifiable Example

import tempfile
from datetime import datetime, timedelta
import numpy as np
import xarray as xr
import zarr

with tempfile.TemporaryDirectory() as temp_dir:
    storage = temp_dir
    # Test parameters
    base_time = datetime(year=1, month=1, day=1)
    time_dtype = "datetime64[ms]"
    time_unit = "milliseconds since 1970-01-01T00:00:00"

    time_encoding = {
        "units": time_unit,
        "dtype": time_dtype,
    }

    print(f"Base time: {base_time}")
    print(f"Time encoding: {time_encoding}")

    # Write first timestep
    print("\n--- Writing first timestep ---")
    sim_time1 = base_time + timedelta(minutes=1)
    time_coord1 = np.array([np.datetime64(sim_time1, "ms")], dtype=time_dtype)

    data1 = xr.DataArray(
        data=np.array([[[1.0, 2.0], [3.0, 4.0]]]),
        coords={"time": time_coord1, "y": [0, 1], "x": [0, 1]},
        dims=["time", "y", "x"],
        name="test_var"
    )

    ds1 = xr.Dataset({"test_var": data1})

    ds1.to_zarr(
        storage,
        encoding={"time": time_encoding},  # Encoding provided here
        mode="w",
    )

    print(f"Written: {sim_time1}")

    # Write second timestep using append_dim (this is where the bug occurs)
    print("\n--- Writing second timestep with append_dim ---")
    sim_time2 = base_time + timedelta(minutes=2)
    time_coord2 = np.array([np.datetime64(sim_time2, "ms")], dtype=time_dtype)

    data2 = xr.DataArray(
        data=np.array([[[5.0, 6.0], [7.0, 8.0]]]),
        coords={"time": time_coord2, "y": [0, 1], "x": [0, 1]},
        dims=["time", "y", "x"],
        name="test_var"
    )

    ds2 = xr.Dataset({"test_var": data2})

    ds2.to_zarr(
        storage,
        append_dim="time",
        mode="a",
        # NOTE: Cannot pass encoding={"time": time_encoding} here!
        # xarray raises: "variable 'time' already exists, but encoding was provided"
    )

    print(f"Written: {sim_time2}")

    # Read back and demonstrate the bug
    print("\n--- Reading back data ---")
    ds_read = xr.open_zarr(storage)

    print(f"Time coordinate values: {ds_read['time'].values}")
    print(f"Time dtype: {ds_read['time'].dtype}")

    # Expected vs actual
    expected_times = [
        np.datetime64(sim_time1, "ms"),
        np.datetime64(sim_time2, "ms")
    ]
    actual_times = ds_read['time'].values

    print(f"\nExpected: {expected_times}")
    print(f"Actual:   {actual_times}")

    # Check if bug is present
    assert np.array_equal(expected_times, actual_times)

MVCE confirmation

  • [x] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [x] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [x] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [x] New issue — a search of GitHub Issues suggests this is not a duplicate.
  • [x] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

Base time: 0001-01-01 00:00:00
Time encoding: {'units': 'milliseconds since 1970-01-01T00:00:00', 'dtype': 'datetime64[ms]'}

--- Writing first timestep ---
/home/laurent/software/itzi/.venv/lib/python3.12/site-packages/zarr/api/asynchronous.py:228: UserWarning: Consolidated metadata is currently not part in the Zarr format 3 specification. It may not be supported by other zarr implementations and may change in the future.
  warnings.warn(
Written: 0001-01-01 00:01:00

--- Writing second timestep with append_dim ---
/home/laurent/software/itzi/.venv/lib/python3.12/site-packages/zarr/api/asynchronous.py:228: UserWarning: Consolidated metadata is currently not part in the Zarr format 3 specification. It may not be supported by other zarr implementations and may change in the future.
  warnings.warn(
Written: 0001-01-01 00:02:00

--- Reading back data ---
Time coordinate values: ['0001-01-01T00:01:00.000' '1970-01-01T00:00:00.000']
Time dtype: datetime64[ms]

Expected: [np.datetime64('0001-01-01T00:01:00.000'), np.datetime64('0001-01-01T00:02:00.000')]
Actual:   ['0001-01-01T00:01:00.000' '1970-01-01T00:00:00.000']
Traceback (most recent call last):
  File "/home/laurent/software/itzi/xarray_zarr_time.py", line 86, in <module>
    assert np.array_equal(expected_times, actual_times)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None python: 3.12.3 (main, Jun 18 2025, 17:59:45) [GCC 13.3.0] python-bits: 64 OS: Linux OS-release: 6.14.0-27-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.14.2 libnetcdf: 4.9.4-development

xarray: 2025.7.1 pandas: 2.3.1 numpy: 2.3.2 scipy: 1.16.1 netCDF4: 1.7.2 pydap: 3.5.5 h5netcdf: 1.6.4 h5py: 3.14.0 zarr: 3.1.1 cftime: 1.6.4.post1 nc_time_axis: None iris: None bottleneck: None dask: None distributed: None matplotlib: None cartopy: None seaborn: None numbagg: None fsspec: 2025.7.0 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 80.9.0 pip: None conda: None pytest: 8.4.1 mypy: None IPython: None sphinx: None

lrntct avatar Aug 13 '25 17:08 lrntct

Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!

welcome[bot] avatar Aug 13 '25 17:08 welcome[bot]

Thank you for reporting @lrntct! I agree that this is a bug. I tried setting the encoding directly on the time variable in the data array:


import numpy as np
import xarray as xr

storage = "foo"

time1 = "2001-01-01 00:01"
time2 = "2001-01-01 00:02"

time_encoding = {
    "dtype": "datetime64[ms]",
    "units": "milliseconds since 1970-01-01T00:00:00",
}

data1 = xr.DataArray(
    data=np.array([[[1.0, 2.0], [3.0, 4.0]]]),
    coords={
        "time": np.array([np.datetime64(time1, "ms")]),
        "y": [0, 1],
        "x": [0, 1]
    },
    dims=["time", "y", "x"],
    name="test_var"
)
data1.time.encoding = time_encoding

ds1 = xr.Dataset({"test_var": data1})

ds1.to_zarr(
    storage,
    mode="w",
)

data2 = xr.DataArray(
    data=np.array([[[5.0, 6.0], [7.0, 8.0]]]),
    coords={"time": np.array([np.datetime64(time2, "ms")]), "y": [0, 1], "x": [0, 1]},
    dims=["time", "y", "x"],
    name="test_var"
)
data2.time.encoding = time_encoding

ds2 = xr.Dataset({"test_var": data2})

ds2.to_zarr(
    storage,
    append_dim="time",
    mode="a",
)

ds_read = xr.open_zarr(storage)

print(f"Time coordinate values: {ds_read['time'].values}")

output:

Time coordinate values: ['2001-01-01T00:01:00.000' '1970-01-01T00:00:00.000']

I think the issue is that it's not clear whether it is xarray or zarr's responsibility to handle time encoding like this, but I'm not quite sure.

jsignell avatar Aug 26 '25 18:08 jsignell

Thank you for looking at this @jsignell ! If this helps, my workaround for now is to write directly using zarr (only one value in the appended time coordinate):

def _zarr_append(self, store, dataset: xr.Dataset) -> None:
        """Zarr append using direct indexing."""
        # Open the zarr group
        z_group = zarr.open_group(store, mode="r+")

        # Get the new time value
        new_time = dataset["time"].values[0]

        # Append time coordinate
        current_time_size = z_group["time"].shape[0]
        z_group["time"].resize(current_time_size + 1)
        z_group["time"][current_time_size] = new_time

        # Append data for each variable
        for var_name, data_array in dataset.data_vars.items():
            current_shape = z_group[var_name].shape
            new_shape = (current_shape[0] + 1,) + current_shape[1:]
            z_group[var_name].resize(new_shape)
            # Use direct assignment
            z_group[var_name][current_shape[0]] = data_array.values[0]

lrntct avatar Aug 26 '25 18:08 lrntct

This might be related to a zarr bug where it was (still soemtimes is) ignoring the config that was passed to it when appending: https://github.com/zarr-developers/zarr-python/issues/2979

which I could see being realted the to encoding

ianhi avatar Aug 27 '25 20:08 ianhi

I wonder if #9154 and #3942 are related?

lrntct avatar Nov 03 '25 13:11 lrntct

Yeah I was wondering if those ones were related too, but they feel kind of different I think. In this case there is no need for xarray/zarr-python to try to figure out the right encoding for the data. It should already know it!

jsignell avatar Nov 06 '25 20:11 jsignell