zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

Can't create big endian dtypes in V3 array

Open rabernat opened this issue 1 year ago • 10 comments

This works with V2 data:

zarr.create(shape=10, dtype=">i2", zarr_version=2)
# -> <Array memory://4413530368 shape=(10,) dtype=>i2>

But raises for V3

zarr.create(shape=10, dtype=">i2", zarr_version=3)
File ~/gh/zarr-developers/zarr-python/src/zarr/codecs/__init__.py:40, in _get_default_array_bytes_codec(np_dtype)
     37 def _get_default_array_bytes_codec(
     38     np_dtype: np.dtype[Any],
     39 ) -> BytesCodec | VLenUTF8Codec | VLenBytesCodec:
---> 40     dtype = DataType.from_numpy(np_dtype)
     41     if dtype == DataType.string:
     42         return VLenUTF8Codec()

File ~/gh/zarr-developers/zarr-python/src/zarr/core/metadata/v3.py:599, in DataType.from_numpy(cls, dtype)
    581     return DataType.bytes
    582 dtype_to_data_type = {
    583     "|b1": "bool",
    584     "bool": "bool",
   (...)
    597     "<c16": "complex128",
    598 }
--> 599 return DataType[dtype_to_data_type[dtype.str]]

KeyError: '>i2'

In the V3 spec, endianness is now handled by a codec: https://zarr-specs.readthedocs.io/en/latest/v3/codecs/bytes/v1.0.html

Xarray tests create data with big endian dtypes, and Zarr needs to know how to handle them.

rabernat avatar Oct 09 '24 14:10 rabernat

If the codecs are unspecified, then I think we could automatically parametrize the BytesCodec based on the dtype. If the codecs are specified and the BytesCodec endianness doesn't match the endianness of the data, then we raise an exception.

But a bigger problem is that, by making endianness a serialization detail, the zarr dtype model has diverged from the numpy dtype model. If our array object uses zarr v3 data type semantics, then zarr.create(..., dtype=">i2") will return an array with dtype <i2 + a special bytes codec. From the POV of functions like np.array_like, this zarr array will not have its "real" dtype; users might be surprised to see that zarr.create(..., dtype=">i2") and zarr.create(..., dtype="<i2") returns arrays with the same dtype. I don't see an easy solution to this.

d-v-b avatar Oct 09 '24 15:10 d-v-b

One solution could be to always translate the endianness of the on-disk data to the endianness of the in-memory data. This could be done within BytesCodec. However, it would be hard, since endianness is not part of ArraySpec.

rabernat avatar Oct 12 '24 12:10 rabernat

Looks like this either needs resolving, or documenting as a breaking change at https://github.com/zarr-developers/zarr-python/pull/2596 for zarr 3

dstansby avatar Dec 30 '24 17:12 dstansby

Should we put endianness in the new runtime ArrayConfig? We could parse the dtype to set it.

normanrz avatar Jan 07 '25 18:01 normanrz

I've moved this to "After 3.0.0" and will be adding this to the work in progress section of the v3 migration docs.

jhamman avatar Jan 08 '25 00:01 jhamman

I'm running into this too - just to check, is this something that is going to be fixed in the 3.0.x series of releases, or is it a breaking change that will not be changed that we should adjust existing code to?

astrofrog avatar Feb 01 '25 14:02 astrofrog

I think we intend to fix this, but it will force us to revise the semantics of the Array.dtype attribute. The alternative to handling endianness the way users expect is unacceptable IMO.

d-v-b avatar Feb 01 '25 14:02 d-v-b

this will be fixed in #2874

d-v-b avatar Mar 24 '25 16:03 d-v-b

this was resolved by #2874:

# /// script
# requires-python = ">=3.11"
# dependencies = [
#   "zarr @ git+https://github.com/zarr-developers/zarr-python.git@27615fd0",
# ]
# ///

store = {}

import zarr

z_write = zarr.create_array(store = store, shape=10, dtype=">i2", zarr_format=3)

print(z_write.metadata.data_type)
# Int16(endianness='big')

print(z_write.dtype)
# >i2

# The data type will change to platform endianness when reading the array.
z_read = zarr.open_array(store=store, mode="r", zarr_format=3)

print(z_read.metadata.data_type)
# Int16(endianness='little')

print(z_read.dtype)
# int16

it might be surprising that the endianness of the data type is not the same when you read the array again. This is because the zarr v3 spec does not define the endianness of decoded arrays, so there is currently no place to store this information in the metadata.

If this is a problem, we could consider changes at the spec level, or in zarr python.

d-v-b avatar Jun 26 '25 15:06 d-v-b

@astrofrog @rabernat does this solve your problem?

d-v-b avatar Jun 26 '25 15:06 d-v-b

I'll close since this seems to be fixed - feel free to open a new issue or re-open this one if it's not fixed.

dstansby avatar Aug 05 '25 09:08 dstansby