zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

Can not create Metadata for structured dtype containing subarray dtype

Open sehoffmann opened this issue 1 month ago • 4 comments

Zarr version

v3.1.3

Numcodecs version

Python Version

3.12

Operating System

Linux

Installation

uv

Description

Output:

 Traceback (most recent call last):
  File "bug.py", line 19, in <module>
    arr = zarr.create_array(store, name='test', shape=(10,), dtype=DTYPE)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "zarr/api/synchronous.py", line 962, in create_array
    sync(
  File "zarr/core/sync.py", line 159, in sync
    raise return_result
  File "zarr/core/sync.py", line 119, in _runner
    return await coro
           ^^^^^^^^^^
  File "zarr/core/array.py", line 4933, in create_array
    return await init_array(
           ^^^^^^^^^^^^^^^^^
  File "zarr/core/array.py", line 4747, in init_array
    meta = AsyncArray._create_metadata_v3(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "zarr/core/array.py", line 772, in _create_metadata_v3
    fill_value_parsed = dtype.default_scalar()
                        ^^^^^^^^^^^^^^^^^^^^^^
  File "zarr/core/dtype/npy/structured.py", line 419, in default_scalar
    return self._cast_scalar_unchecked(0)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "zarr/core/dtype/npy/structured.py", line 373, in _cast_scalar_unchecked
    res = np.array([data], dtype=na_dtype)[0]
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: a bytes-like object is required, not 'int'

Case #2 fill_value=(0, np.nan):

zarr/core/dtype/npy/structured.py", line 371, in _cast_scalar_unchecked
    res = np.array([tuple(data)], dtype=na_dtype)[0]
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: a bytes-like object is required, not 'float'

Case #3 fill_value=0 TypeError: a bytes-like object is required, not 'int'

Case #4 fill_value=None TypeError: a bytes-like object is required, not 'int'

Case #5: fill_value={}

zarr/core/metadata/v3.py", line 236, in __init__
    fill_value_parsed = data_type.cast_scalar(fill_value)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "zarr/core/dtype/npy/structured.py", line 406, in cast_scalar
    raise TypeError(msg)
TypeError: Cannot convert object {} with type <class 'dict'> to a scalar compatible with the data type Structured(fields=(('a', Int32(endianness='little')), ('b', RawBytes(length=1600)))).

Case #6: fill_value={'a': 1, 'b': np.nan} This was possible before in zarr v2.x

zarr/core/dtype/npy/structured.py", line 406, in cast_scalar
    raise TypeError(msg)
TypeError: Cannot convert object {'a': 1, 'b': nan} with type <class 'dict'> to a scalar compatible with the data type Structured(fields=(('a', Int32(endianness='little')), ('b', RawBytes(length=1600)))).

Steps to reproduce

# /// script
# requires-python = ">=3.11"
# dependencies = [
#   "zarr@git+https://github.com/zarr-developers/zarr-python.git@main",
# ]
# ///
#
# This script automatically imports the development branch of zarr to check for issues

import zarr
from zarr.storage import LocalStore
import numpy as np

DTYPE = np.dtype([('a', 'i4'), ('b', 'f4', (20,20))])

store = LocalStore('bug.zarr')
arr = zarr.create_array(store, name='test', shape=(10,), dtype=DTYPE)

Expected behavior: same as np.empty (preferred) or np.zero when using a structured dtype

Additional output

No response

sehoffmann avatar Nov 17 '25 17:11 sehoffmann

thanks for this report! I'd love to reach feature parity with zarr-python 2.x here. That being said, I'm not sure when I will have time to look into this, so the fastest solution might be to dig into the code yourself. I will definitely review any fix quickly!

d-v-b avatar Nov 17 '25 18:11 d-v-b

@d-v-b Thanks for the quick reply. I would gladly help on this issue as as structured dtypes are high priority for our data handling (we need to keep inode count low). Do you have any pointers where to get started since I am not too familiar with the internal details of zarr?

sehoffmann avatar Nov 17 '25 22:11 sehoffmann

for a primer on our internal model of data types, I would look at this complete example that demonstrates how to create a new custom data type from scratch. And please let us know if anything about that example can be improved. That basically shows all the moving parts of the data type interface.

Next, you would need to figure out the best way to model the missing data type (an N-dimensional array of other data types) as a Zarr data type. If you create such a data type and register it, the structured dtype should just work. But let us know if it doesnt!

d-v-b avatar Nov 17 '25 22:11 d-v-b

For reference purposes, this is the .zarray metadata created with v2.18.5 for the following test code:

dtype = np.dtype([('a', 'f4', (2, 2)), ('b', 'i4')])
arr = zarr.create(store='dtype.zarr', shape=(10,), dtype=dtype, zarr_version=2, fill_value=0)
{
    "chunks": [
        10
    ],
    "compressor": {
        "blocksize": 0,
        "clevel": 5,
        "cname": "lz4",
        "id": "blosc",
        "shuffle": 1
    },
    "dtype": [
        [
            "a",
            "<f4",
            [
                2,
                2
            ]
        ],
        [
            "b",
            "<i4"
        ]
    ],
    "fill_value": "AAAAAAAAAAAAAAAAAAAAAAAAAAA=",
    "filters": null,
    "order": "C",
    "shape": [
        10
    ],
    "zarr_format": 2
}

My ZDtype impl. is mostly done, but backporting it for Metadatav2 support is a bit tricky. In particular, the above metadata json is not really supported at the moment if I am not mistaken. Structured assumes that the second sequence element is either a string or a nested Structured dtype.

Using a subarray dtype directly in v2.18.5 produces a "flattened" zarr array:

dtype = np.dtype(('f4', (2, 2)))
arr = zarr.create(store='dtype.zarr', shape=(10,), dtype=dtype, zarr_version=2, fill_value=0)
{
    "chunks": [
        10,
        2,
        2
    ],
    "compressor": {
        "blocksize": 0,
        "clevel": 5,
        "cname": "lz4",
        "id": "blosc",
        "shuffle": 1
    },
    "dtype": "<f4",
    "fill_value": 0.0,
    "filters": null,
    "order": "C",
    "shape": [
        10,
        2,
        2
    ],
    "zarr_format": 2
}

sehoffmann avatar Nov 19 '25 15:11 sehoffmann