Can't create big endian dtypes in V3 array
This works with V2 data:
zarr.create(shape=10, dtype=">i2", zarr_version=2)
# -> <Array memory://4413530368 shape=(10,) dtype=>i2>
But raises for V3
zarr.create(shape=10, dtype=">i2", zarr_version=3)
File ~/gh/zarr-developers/zarr-python/src/zarr/codecs/__init__.py:40, in _get_default_array_bytes_codec(np_dtype)
37 def _get_default_array_bytes_codec(
38 np_dtype: np.dtype[Any],
39 ) -> BytesCodec | VLenUTF8Codec | VLenBytesCodec:
---> 40 dtype = DataType.from_numpy(np_dtype)
41 if dtype == DataType.string:
42 return VLenUTF8Codec()
File ~/gh/zarr-developers/zarr-python/src/zarr/core/metadata/v3.py:599, in DataType.from_numpy(cls, dtype)
581 return DataType.bytes
582 dtype_to_data_type = {
583 "|b1": "bool",
584 "bool": "bool",
(...)
597 "<c16": "complex128",
598 }
--> 599 return DataType[dtype_to_data_type[dtype.str]]
KeyError: '>i2'
In the V3 spec, endianness is now handled by a codec: https://zarr-specs.readthedocs.io/en/latest/v3/codecs/bytes/v1.0.html
Xarray tests create data with big endian dtypes, and Zarr needs to know how to handle them.
If the codecs are unspecified, then I think we could automatically parametrize the BytesCodec based on the dtype. If the codecs are specified and the BytesCodec endianness doesn't match the endianness of the data, then we raise an exception.
But a bigger problem is that, by making endianness a serialization detail, the zarr dtype model has diverged from the numpy dtype model. If our array object uses zarr v3 data type semantics, then zarr.create(..., dtype=">i2") will return an array with dtype <i2 + a special bytes codec. From the POV of functions like np.array_like, this zarr array will not have its "real" dtype; users might be surprised to see that zarr.create(..., dtype=">i2") and zarr.create(..., dtype="<i2") returns arrays with the same dtype. I don't see an easy solution to this.
One solution could be to always translate the endianness of the on-disk data to the endianness of the in-memory data. This could be done within BytesCodec. However, it would be hard, since endianness is not part of ArraySpec.
Looks like this either needs resolving, or documenting as a breaking change at https://github.com/zarr-developers/zarr-python/pull/2596 for zarr 3
Should we put endianness in the new runtime ArrayConfig? We could parse the dtype to set it.
I've moved this to "After 3.0.0" and will be adding this to the work in progress section of the v3 migration docs.
I'm running into this too - just to check, is this something that is going to be fixed in the 3.0.x series of releases, or is it a breaking change that will not be changed that we should adjust existing code to?
I think we intend to fix this, but it will force us to revise the semantics of the Array.dtype attribute. The alternative to handling endianness the way users expect is unacceptable IMO.
this will be fixed in #2874
this was resolved by #2874:
# /// script
# requires-python = ">=3.11"
# dependencies = [
# "zarr @ git+https://github.com/zarr-developers/zarr-python.git@27615fd0",
# ]
# ///
store = {}
import zarr
z_write = zarr.create_array(store = store, shape=10, dtype=">i2", zarr_format=3)
print(z_write.metadata.data_type)
# Int16(endianness='big')
print(z_write.dtype)
# >i2
# The data type will change to platform endianness when reading the array.
z_read = zarr.open_array(store=store, mode="r", zarr_format=3)
print(z_read.metadata.data_type)
# Int16(endianness='little')
print(z_read.dtype)
# int16
it might be surprising that the endianness of the data type is not the same when you read the array again. This is because the zarr v3 spec does not define the endianness of decoded arrays, so there is currently no place to store this information in the metadata.
If this is a problem, we could consider changes at the spec level, or in zarr python.
@astrofrog @rabernat does this solve your problem?
I'll close since this seems to be fixed - feel free to open a new issue or re-open this one if it's not fixed.