Can not create Metadata for structured dtype containing subarray dtype
Zarr version
v3.1.3
Numcodecs version
Python Version
3.12
Operating System
Linux
Installation
uv
Description
Output:
Traceback (most recent call last):
File "bug.py", line 19, in <module>
arr = zarr.create_array(store, name='test', shape=(10,), dtype=DTYPE)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "zarr/api/synchronous.py", line 962, in create_array
sync(
File "zarr/core/sync.py", line 159, in sync
raise return_result
File "zarr/core/sync.py", line 119, in _runner
return await coro
^^^^^^^^^^
File "zarr/core/array.py", line 4933, in create_array
return await init_array(
^^^^^^^^^^^^^^^^^
File "zarr/core/array.py", line 4747, in init_array
meta = AsyncArray._create_metadata_v3(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "zarr/core/array.py", line 772, in _create_metadata_v3
fill_value_parsed = dtype.default_scalar()
^^^^^^^^^^^^^^^^^^^^^^
File "zarr/core/dtype/npy/structured.py", line 419, in default_scalar
return self._cast_scalar_unchecked(0)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "zarr/core/dtype/npy/structured.py", line 373, in _cast_scalar_unchecked
res = np.array([data], dtype=na_dtype)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: a bytes-like object is required, not 'int'
Case #2 fill_value=(0, np.nan):
zarr/core/dtype/npy/structured.py", line 371, in _cast_scalar_unchecked
res = np.array([tuple(data)], dtype=na_dtype)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: a bytes-like object is required, not 'float'
Case #3 fill_value=0
TypeError: a bytes-like object is required, not 'int'
Case #4 fill_value=None
TypeError: a bytes-like object is required, not 'int'
Case #5: fill_value={}
zarr/core/metadata/v3.py", line 236, in __init__
fill_value_parsed = data_type.cast_scalar(fill_value)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "zarr/core/dtype/npy/structured.py", line 406, in cast_scalar
raise TypeError(msg)
TypeError: Cannot convert object {} with type <class 'dict'> to a scalar compatible with the data type Structured(fields=(('a', Int32(endianness='little')), ('b', RawBytes(length=1600)))).
Case #6: fill_value={'a': 1, 'b': np.nan}
This was possible before in zarr v2.x
zarr/core/dtype/npy/structured.py", line 406, in cast_scalar
raise TypeError(msg)
TypeError: Cannot convert object {'a': 1, 'b': nan} with type <class 'dict'> to a scalar compatible with the data type Structured(fields=(('a', Int32(endianness='little')), ('b', RawBytes(length=1600)))).
Steps to reproduce
# /// script
# requires-python = ">=3.11"
# dependencies = [
# "zarr@git+https://github.com/zarr-developers/zarr-python.git@main",
# ]
# ///
#
# This script automatically imports the development branch of zarr to check for issues
import zarr
from zarr.storage import LocalStore
import numpy as np
DTYPE = np.dtype([('a', 'i4'), ('b', 'f4', (20,20))])
store = LocalStore('bug.zarr')
arr = zarr.create_array(store, name='test', shape=(10,), dtype=DTYPE)
Expected behavior: same as np.empty (preferred) or np.zero when using a structured dtype
Additional output
No response
thanks for this report! I'd love to reach feature parity with zarr-python 2.x here. That being said, I'm not sure when I will have time to look into this, so the fastest solution might be to dig into the code yourself. I will definitely review any fix quickly!
@d-v-b Thanks for the quick reply. I would gladly help on this issue as as structured dtypes are high priority for our data handling (we need to keep inode count low). Do you have any pointers where to get started since I am not too familiar with the internal details of zarr?
for a primer on our internal model of data types, I would look at this complete example that demonstrates how to create a new custom data type from scratch. And please let us know if anything about that example can be improved. That basically shows all the moving parts of the data type interface.
Next, you would need to figure out the best way to model the missing data type (an N-dimensional array of other data types) as a Zarr data type. If you create such a data type and register it, the structured dtype should just work. But let us know if it doesnt!
For reference purposes, this is the .zarray metadata created with v2.18.5 for the following test code:
dtype = np.dtype([('a', 'f4', (2, 2)), ('b', 'i4')])
arr = zarr.create(store='dtype.zarr', shape=(10,), dtype=dtype, zarr_version=2, fill_value=0)
{
"chunks": [
10
],
"compressor": {
"blocksize": 0,
"clevel": 5,
"cname": "lz4",
"id": "blosc",
"shuffle": 1
},
"dtype": [
[
"a",
"<f4",
[
2,
2
]
],
[
"b",
"<i4"
]
],
"fill_value": "AAAAAAAAAAAAAAAAAAAAAAAAAAA=",
"filters": null,
"order": "C",
"shape": [
10
],
"zarr_format": 2
}
My ZDtype impl. is mostly done, but backporting it for Metadatav2 support is a bit tricky. In particular, the above metadata json is not really supported at the moment if I am not mistaken. Structured assumes that the second sequence element is either a string or a nested Structured dtype.
Using a subarray dtype directly in v2.18.5 produces a "flattened" zarr array:
dtype = np.dtype(('f4', (2, 2)))
arr = zarr.create(store='dtype.zarr', shape=(10,), dtype=dtype, zarr_version=2, fill_value=0)
{
"chunks": [
10,
2,
2
],
"compressor": {
"blocksize": 0,
"clevel": 5,
"cname": "lz4",
"id": "blosc",
"shuffle": 1
},
"dtype": "<f4",
"fill_value": 0.0,
"filters": null,
"order": "C",
"shape": [
10,
2,
2
],
"zarr_format": 2
}