xarray
xarray copied to clipboard
Storing np.datetime64 attributtes in zarr files
What happened?
I have a dataset with an attribute which is a time, stored as a np.datetime64 value with nanosecond precision. Saving this to a zarr store and loading the dataset again drops the type of this attribute and loads it as an integer.
Example dataset:
<xarray.DataArray (x: 5)> Size: 20B
array([0, 1, 2, 3, 4])
Dimensions without coordinates: x
Attributes:
time: 2024-10-02T07:39:39.000000000
gets loaded back as
<xarray.DataArray (x: 5)> Size: 20B
[5 values with dtype=int32]
Dimensions without coordinates: x
Attributes:
time: 1727854779000000000
Using second precision for the datetime64 (instead of nanosecond) raises an error on json serialization, since it gets converted into a datetime at some point.
What did you expect to happen?
The time gets stored and read back properly.
Minimal Complete Verifiable Example
import xarray as xr
import numpy as np
arr = xr.DataArray(
np.arange(5),
dims="x",
attrs={"time": np.datetime64("now", "ns")},
)
print(arr)
arr.to_zarr("temp.zarr", mode="w")
print(xr.open_dataarray("temp.zarr", engine="zarr"))
arr = xr.DataArray(
np.arange(5),
dims="x",
attrs={"time": np.datetime64("now", "s")},
)
print(arr)
arr.to_zarr("temp.zarr", mode="w")
MVCE confirmation
- [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
- [X] Complete example — the example is self-contained, including all data and the text of any traceback.
- [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
- [X] New issue — a search of GitHub Issues suggests this is not a duplicate.
- [X] Recent environment — the issue occurs with the latest version of xarray and its dependencies.
Relevant log output
Traceback (most recent call last):
File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\site-packages\xarray\backends\zarr.py", line 395, in _put_attrs
zarr_obj.attrs.put(attrs)
File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\site-packages\zarr\attrs.py", line 124, in put
self._write_op(self._put_nosync, d)
File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\site-packages\zarr\attrs.py", line 83, in _write_op
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\site-packages\zarr\attrs.py", line 150, in _put_nosync
self.store[self.key] = json_dumps(d)
^^^^^^^^^^^^^
File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\site-packages\zarr\util.py", line 69, in json_dumps
return json.dumps(
^^^^^^^^^^^
File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\json\__init__.py", line 238, in dumps
**kw).encode(obj)
^^^^^^^^^^^
File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\json\encoder.py", line 202, in encode
chunks = list(chunks)
^^^^^^^^^^^^
File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\json\encoder.py", line 432, in _iterencode
yield from _iterencode_dict(o, _current_indent_level)
File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\json\encoder.py", line 406, in _iterencode_dict
yield from chunks
File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\json\encoder.py", line 439, in _iterencode
o = _default(o)
^^^^^^^^^^^
File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\site-packages\zarr\util.py", line 64, in default
return json.JSONEncoder.default(self, o)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\json\encoder.py", line 180, in default
raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type datetime is not JSON serializable
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\site-packages\xarray\core\dataarray.py", line 4355, in to_zarr
return to_zarr( # type: ignore[call-overload,misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\site-packages\xarray\backends\api.py", line 1784, in to_zarr
dump_to_store(dataset, zstore, writer, encoding=encoding)
File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\site-packages\xarray\backends\api.py", line 1467, in dump_to_store
store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\site-packages\xarray\backends\zarr.py", line 720, in store
self.set_variables(
File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\site-packages\xarray\backends\zarr.py", line 831, in set_variables
zarr_array = _put_attrs(zarr_array, encoded_attrs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\site-packages\xarray\backends\zarr.py", line 397, in _put_attrs
raise TypeError("Invalid attribute in Dataset.attrs.") from e
TypeError: Invalid attribute in Dataset.attrs.
Anything else we need to know?
No response
Environment
INSTALLED VERSIONS
commit: None python: 3.12.6 | packaged by conda-forge | (main, Sep 30 2024, 17:48:58) [MSC v.1941 64 bit (AMD64)] python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 142 Stepping 12, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: ('Swedish_Sweden', '1252') libhdf5: None libnetcdf: None
xarray: 2024.9.0 pandas: 2.2.3 numpy: 2.1.1 scipy: None netCDF4: None pydap: None h5netcdf: None h5py: None zarr: 2.18.3 cftime: None nc_time_axis: None iris: None bottleneck: None dask: None distributed: None matplotlib: None cartopy: None seaborn: None numbagg: None fsspec: None cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: None pip: None conda: None pytest: None mypy: None IPython: None sphinx: None
Do we want to have arbitrary python objects stored in attrs? We serialize to json so arguably need to constrain ourselves to types that are JSON-compatible...
the question is, would zarr be able to store datetimes without encoding? If so, I believe it may be possible to extend the zarr backend specifically to allow this (though not sure if that would make the encoding machinery too complicated?).
We could ofc serialize and deserialize into our own propriety format. But I'm not sure what the interface would be?
In this case I was just wondering whether we can get away with not serializing datetimes at all (but only for the zarr backend, if the zarr format supports this).
I agree that serializing attributes might be useful (see the many CRS representations, for example) but potentially too complex at this point. Also, a custom format convention both means a lot of work and won't be compatible with other libraries, especially from other languages.
For datetime64 arrays in data_vars or coords, they get encoded as integers with a "custom" format, with e.g. "seconds since 1970-01-01" written to .zattrs/units. So in some sense there is already a format convention for datetimes, it's just not used for attrs.