zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

Subarray dtypes get lost on serialization / casted to void type

Open sehoffmann opened this issue 1 month ago • 1 comments

Zarr version

v3.1.3

Numcodecs version

v0.15.1

Python Version

3.12.10

Operating System

Linux

Installation

uv / pip

Description

Subarray dtypes are not properly serialized but are cast to raw bytes / void dtype upon serialization. Subsequent access, hence, does not yield arrays with the proper shapes.

Output:

zarr/core/dtype/npy/structured.py:318: UnstableSpecificationWarning: The data type (Structured(fields=(('a', Int32(endianness='little')), ('b', RawBytes(length=100))))) does not have a Zarr V3 specification. That means that the representation of arrays saved with this data type may change without warning in a future version of Zarr Python. Arrays stored with this data type may be unreadable by other Zarr libraries. Use this data type at your own risk! Check https://github.com/zarr-developers/zarr-extensions/tree/main/data-types for the status of data type specifications for Zarr V3.
  v3_unstable_dtype_warning(self)
zarr/core/dtype/npy/bytes.py:785: UnstableSpecificationWarning: The data type (RawBytes(length=100)) does not have a Zarr V3 specification. That means that the representation of arrays saved with this data type may change without warning in a future version of Zarr Python. Arrays stored with this data type may be unreadable by other Zarr libraries. Use this data type at your own risk! Check https://github.com/zarr-developers/zarr-extensions/tree/main/data-types for the status of data type specifications for Zarr V3.
  v3_unstable_dtype_warning(self)
Original dtype: [('a', '<i4'), ('b', '<f4', (5, 5))]
Array created with dtype: [('a', '<i4'), ('b', 'V100')]
Accessed item dtype:  [('a', '<i4'), ('b', 'V100')]

zarr.json:

{
  "shape": [
    10
  ],
  "data_type": {
    "name": "structured",
    "configuration": {
      "fields": [
        [
          "a",
          "int32"
        ],
        [
          "b",
          {
            "name": "raw_bytes",
            "configuration": {
              "length_bytes": 100
            }
          }
        ]
      ]
    }
  },
  "chunk_grid": {
    "name": "regular",
    "configuration": {
      "chunk_shape": [
        10
      ]
    }
  },
  "chunk_key_encoding": {
    "name": "default",
    "configuration": {
      "separator": "/"
    }
  },
  "fill_value": "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=",
  "codecs": [
    {
      "name": "bytes"
    },
    {
      "name": "zstd",
      "configuration": {
        "level": 0,
        "checksum": false
      }
    }
  ],
  "attributes": {},
  "zarr_format": 3,
  "node_type": "array",
  "storage_transformers": []
}

Steps to reproduce

# /// script
# requires-python = ">=3.11"
# dependencies = [
#   "zarr@git+https://github.com/zarr-developers/zarr-python.git@main",
# ]
# ///
#
# This script automatically imports the development branch of zarr to check for issues

import zarr
from zarr.storage import LocalStore
import numpy as np
# your reproducer code
# zarr.print_debug_info()

DTYPE = np.dtype([('a', 'i4'), ('b', 'f4', (5,5))])

store = LocalStore('bug.zarr')
arr = zarr.create_array(store, name='test', shape=(10,), dtype=DTYPE, fill_value=bytes(DTYPE.itemsize))

print('Original dtype:', DTYPE)
print('Array created with dtype:', arr.dtype)
print('Accessed item dtype: ', arr[0].dtype)

Additional output

No response

sehoffmann avatar Nov 17 '25 16:11 sehoffmann

i think a fundamental issue here is that the structured data type is parameterized by the inner data types, and we don't have an inner data type that can express "a 5x5 array of 32-bit floats". This wasn't a problem for zarr python 2.x because it only supported zarr 2, which uses numpy's data type model wholesale. But in zarr python 3.x, we have to support zarr v2 and v3 arrays with the same data type classes, and so to resolve this issue we need a Zarr V3 data type that can express fixed-size arrays of other data types.

@rabernat would the arrow prototype you are working on be helpful here?

d-v-b avatar Nov 17 '25 19:11 d-v-b