xarray icon indicating copy to clipboard operation
xarray copied to clipboard

compatibility with zarr dtypes refactor

Open d-v-b opened this issue 6 months ago • 1 comments

What is your issue?

This is an issue to track compatibility between xarray and the in-progress zarr-python data types refactoring effort.

We are working on a new data type model for zarr-python. Why? Zarr-python 2 used numpy dtypes internally, and zarr v2 (the format) also used the numpy data type model. Fitting the spec heavily to numpy proved problematic for zarr implementations in other languages.

Zarr v3 introduced a new data type model that looks much less like numpy dtypes. The v3 spec defines fewer dtypes than numpy supports, for example, and the v3 dtypes model doesn't track endianness. So we shipped zarr-python 3 with zarr v3 support for only the data types described in the zarr v3 spec, which left out some important numpy data types:

type string code zarr v3 spec
fixed-length ascii strings S PR
fixed-length unicode strings U PR
datetime64 M numpy.datetime64
timedelta64 m numpy.timedelta64
fixed-length raw byes V None yet
structured data types V None yet

Support for these missing numpy data types is being added in this PR against zarr-python. It's turned into quite an effort. In parallel with the zarr-python implementation, we are also writing up language-agnostic specs for these data types, so that other zarr implementations can easily support them. See the third column of the table.

I opened a compatibility PR against xarray that sources zarr-python from the new dtypes branch. When the compatibility PR indicates that all tests are passing, and when we are satisfied that there are no remaining questions relating to the impact of zarr-python's new dtype model and xarray, then we can close this issue.

We are looking to release this functionality in zarr-python 3.1, but I can't give a timeline for that yet. Until then, I'm happy to answer any questions people have about this effort.

d-v-b avatar May 19 '25 08:05 d-v-b

haha I didn't see that @ianhi already had an issue open about this. happy to keep this open if anyone has high-level questions about the zarr-python changes, but we could also close this if devs think it's redundant.

d-v-b avatar May 19 '25 11:05 d-v-b

Thanks for this great update! When trying to get xarray 2025.7.1 working with Zarr 3.1.0 using zarr_format=3, we get various UnstableSpecificationWarnings (see below).

I'm I right that these warnings are there as a caution notice until the Zarr extension for the types added in https://github.com/zarr-developers/zarr-python/pull/2874 is finalized, but that otherwise these types are expected to work without issues or warnings in the near-ish future?

Otherwise everything seems to be working smoothly when saving (to_zarr()) and opening (open()) relatively complex DataTrees.

Example warnings:

  • When we try to save "U#" type coordinates and data arrays, warnings like these are issued:
zarr.core.dtype.common.UnstableSpecificationWarning: The data type (FixedLengthUTF32(length=29, endianness='little')) 
does not have a Zarr V3 specification. That means that the representation of arrays saved with this data type may change 
without warning in a future version of Zarr Python. Arrays stored with this data type may be unreadable by other Zarr 
libraries. Use this data type at your own risk!
Check https://github.com/zarr-developers/zarr-extensions/tree/main/data-types for the status of data type specifications for 
Zarr V3.
  • If we convert those arrays to "S#" types, then we get warnings like:
UnstableSpecificationWarning: The data type (NullTerminatedBytes(length=2)) does not have a Zarr V3 specification. That 
means that the representation of arrays saved with this data type may change without warning in a future version of Zarr 
Python. Arrays stored with this data type may be unreadable by other Zarr libraries. Use this data type at your own risk! 
Check https://github.com/zarr-developers/zarr-extensions/tree/main/data-types for the status of data type specifications for 
Zarr V3.
  • If as an alternative we try to convert them to np.dtypes.StringDType, we'd still get:
StringDType:
UserWarning: The codec `vlen-utf8` is currently not part in the Zarr format 3 specification. It may not be supported by other 
zarr implementations and may change in the future.

FedeMPouzols avatar Jul 17 '25 17:07 FedeMPouzols

yes, those warnings are extremely annoying. the way we can make them go away is by adding specs for the zarr equivalent of the numpy U and S dtypes over in the zarr-extensions repo. For example, I think we can remove the warning about the vlen-utf8 codec, because there's a published spec for that. I will open a PR that removes the warning.

d-v-b avatar Jul 17 '25 17:07 d-v-b