xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Storing np.datetime64 attributtes in zarr files

Open CarlAndersson opened this issue 1 year ago • 5 comments
trafficstars

What happened?

I have a dataset with an attribute which is a time, stored as a np.datetime64 value with nanosecond precision. Saving this to a zarr store and loading the dataset again drops the type of this attribute and loads it as an integer.

Example dataset:

<xarray.DataArray (x: 5)> Size: 20B
array([0, 1, 2, 3, 4])
Dimensions without coordinates: x
Attributes:
    time:     2024-10-02T07:39:39.000000000

gets loaded back as

<xarray.DataArray (x: 5)> Size: 20B
[5 values with dtype=int32]
Dimensions without coordinates: x
Attributes:
    time:     1727854779000000000

Using second precision for the datetime64 (instead of nanosecond) raises an error on json serialization, since it gets converted into a datetime at some point.

What did you expect to happen?

The time gets stored and read back properly.

Minimal Complete Verifiable Example

import xarray as xr
import numpy as np

arr = xr.DataArray(
    np.arange(5),
    dims="x",
    attrs={"time": np.datetime64("now", "ns")},
)
print(arr)
arr.to_zarr("temp.zarr", mode="w")
print(xr.open_dataarray("temp.zarr", engine="zarr"))

arr = xr.DataArray(
    np.arange(5),
    dims="x",
    attrs={"time": np.datetime64("now", "s")},
)
print(arr)
arr.to_zarr("temp.zarr", mode="w")

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.
  • [X] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

Traceback (most recent call last):
  File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\site-packages\xarray\backends\zarr.py", line 395, in _put_attrs
    zarr_obj.attrs.put(attrs)
  File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\site-packages\zarr\attrs.py", line 124, in put
    self._write_op(self._put_nosync, d)
  File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\site-packages\zarr\attrs.py", line 83, in _write_op
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\site-packages\zarr\attrs.py", line 150, in _put_nosync
    self.store[self.key] = json_dumps(d)
                           ^^^^^^^^^^^^^
  File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\site-packages\zarr\util.py", line 69, in json_dumps
    return json.dumps(
           ^^^^^^^^^^^
  File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\json\__init__.py", line 238, in dumps
    **kw).encode(obj)
          ^^^^^^^^^^^
  File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\json\encoder.py", line 202, in encode
    chunks = list(chunks)
             ^^^^^^^^^^^^
  File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\json\encoder.py", line 432, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\json\encoder.py", line 406, in _iterencode_dict
    yield from chunks
  File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\json\encoder.py", line 439, in _iterencode
    o = _default(o)
        ^^^^^^^^^^^
  File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\site-packages\zarr\util.py", line 64, in default
    return json.JSONEncoder.default(self, o)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\json\encoder.py", line 180, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type datetime is not JSON serializable

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\site-packages\xarray\core\dataarray.py", line 4355, in to_zarr
    return to_zarr(  # type: ignore[call-overload,misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\site-packages\xarray\backends\api.py", line 1784, in to_zarr
    dump_to_store(dataset, zstore, writer, encoding=encoding)
  File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\site-packages\xarray\backends\api.py", line 1467, in dump_to_store
    store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
  File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\site-packages\xarray\backends\zarr.py", line 720, in store
    self.set_variables(
  File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\site-packages\xarray\backends\zarr.py", line 831, in set_variables
    zarr_array = _put_attrs(zarr_array, encoded_attrs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\site-packages\xarray\backends\zarr.py", line 397, in _put_attrs
    raise TypeError("Invalid attribute in Dataset.attrs.") from e
TypeError: Invalid attribute in Dataset.attrs.

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None python: 3.12.6 | packaged by conda-forge | (main, Sep 30 2024, 17:48:58) [MSC v.1941 64 bit (AMD64)] python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 142 Stepping 12, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: ('Swedish_Sweden', '1252') libhdf5: None libnetcdf: None

xarray: 2024.9.0 pandas: 2.2.3 numpy: 2.1.1 scipy: None netCDF4: None pydap: None h5netcdf: None h5py: None zarr: 2.18.3 cftime: None nc_time_axis: None iris: None bottleneck: None dask: None distributed: None matplotlib: None cartopy: None seaborn: None numbagg: None fsspec: None cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: None pip: None conda: None pytest: None mypy: None IPython: None sphinx: None

CarlAndersson avatar Oct 02 '24 07:10 CarlAndersson

Do we want to have arbitrary python objects stored in attrs? We serialize to json so arguably need to constrain ourselves to types that are JSON-compatible...

max-sixty avatar Oct 02 '24 08:10 max-sixty

the question is, would zarr be able to store datetimes without encoding? If so, I believe it may be possible to extend the zarr backend specifically to allow this (though not sure if that would make the encoding machinery too complicated?).

keewis avatar Oct 02 '24 08:10 keewis

We could ofc serialize and deserialize into our own propriety format. But I'm not sure what the interface would be?

max-sixty avatar Oct 02 '24 09:10 max-sixty

In this case I was just wondering whether we can get away with not serializing datetimes at all (but only for the zarr backend, if the zarr format supports this).

I agree that serializing attributes might be useful (see the many CRS representations, for example) but potentially too complex at this point. Also, a custom format convention both means a lot of work and won't be compatible with other libraries, especially from other languages.

keewis avatar Oct 02 '24 09:10 keewis

For datetime64 arrays in data_vars or coords, they get encoded as integers with a "custom" format, with e.g. "seconds since 1970-01-01" written to .zattrs/units. So in some sense there is already a format convention for datetimes, it's just not used for attrs.

CarlAndersson avatar Oct 07 '24 06:10 CarlAndersson