Subarray dtypes get lost on serialization / casted to void type
Zarr version
v3.1.3
Numcodecs version
v0.15.1
Python Version
3.12.10
Operating System
Linux
Installation
uv / pip
Description
Subarray dtypes are not properly serialized but are cast to raw bytes / void dtype upon serialization. Subsequent access, hence, does not yield arrays with the proper shapes.
Output:
zarr/core/dtype/npy/structured.py:318: UnstableSpecificationWarning: The data type (Structured(fields=(('a', Int32(endianness='little')), ('b', RawBytes(length=100))))) does not have a Zarr V3 specification. That means that the representation of arrays saved with this data type may change without warning in a future version of Zarr Python. Arrays stored with this data type may be unreadable by other Zarr libraries. Use this data type at your own risk! Check https://github.com/zarr-developers/zarr-extensions/tree/main/data-types for the status of data type specifications for Zarr V3.
v3_unstable_dtype_warning(self)
zarr/core/dtype/npy/bytes.py:785: UnstableSpecificationWarning: The data type (RawBytes(length=100)) does not have a Zarr V3 specification. That means that the representation of arrays saved with this data type may change without warning in a future version of Zarr Python. Arrays stored with this data type may be unreadable by other Zarr libraries. Use this data type at your own risk! Check https://github.com/zarr-developers/zarr-extensions/tree/main/data-types for the status of data type specifications for Zarr V3.
v3_unstable_dtype_warning(self)
Original dtype: [('a', '<i4'), ('b', '<f4', (5, 5))]
Array created with dtype: [('a', '<i4'), ('b', 'V100')]
Accessed item dtype: [('a', '<i4'), ('b', 'V100')]
zarr.json:
{
"shape": [
10
],
"data_type": {
"name": "structured",
"configuration": {
"fields": [
[
"a",
"int32"
],
[
"b",
{
"name": "raw_bytes",
"configuration": {
"length_bytes": 100
}
}
]
]
}
},
"chunk_grid": {
"name": "regular",
"configuration": {
"chunk_shape": [
10
]
}
},
"chunk_key_encoding": {
"name": "default",
"configuration": {
"separator": "/"
}
},
"fill_value": "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=",
"codecs": [
{
"name": "bytes"
},
{
"name": "zstd",
"configuration": {
"level": 0,
"checksum": false
}
}
],
"attributes": {},
"zarr_format": 3,
"node_type": "array",
"storage_transformers": []
}
Steps to reproduce
# /// script
# requires-python = ">=3.11"
# dependencies = [
# "zarr@git+https://github.com/zarr-developers/zarr-python.git@main",
# ]
# ///
#
# This script automatically imports the development branch of zarr to check for issues
import zarr
from zarr.storage import LocalStore
import numpy as np
# your reproducer code
# zarr.print_debug_info()
DTYPE = np.dtype([('a', 'i4'), ('b', 'f4', (5,5))])
store = LocalStore('bug.zarr')
arr = zarr.create_array(store, name='test', shape=(10,), dtype=DTYPE, fill_value=bytes(DTYPE.itemsize))
print('Original dtype:', DTYPE)
print('Array created with dtype:', arr.dtype)
print('Accessed item dtype: ', arr[0].dtype)
Additional output
No response
i think a fundamental issue here is that the structured data type is parameterized by the inner data types, and we don't have an inner data type that can express "a 5x5 array of 32-bit floats". This wasn't a problem for zarr python 2.x because it only supported zarr 2, which uses numpy's data type model wholesale. But in zarr python 3.x, we have to support zarr v2 and v3 arrays with the same data type classes, and so to resolve this issue we need a Zarr V3 data type that can express fixed-size arrays of other data types.
@rabernat would the arrow prototype you are working on be helpful here?