tensorstore icon indicating copy to clipboard operation
tensorstore copied to clipboard

Zarr v3 Struct Support & Tensorstore

Open tasansal opened this issue 5 months ago • 3 comments

Hi @jbms @laramiel

Zarr-Python version 3.1.0 just got released and the v3 arrays now support flexible datatypes. One (not yet marked stable) implementation is the numpy structs as if they were in zarr v2. Below example shows a Zarr that's created with Xarray that holds 2 variables (one struct, one normal) as zarr v3 and sharding.

My understanding is the V3 driver in tensorstore doesn't support numpy structs either. However, I believe it should be trivial because the binaries didn't change between zarr v2/v3 but only metadata definition of the same struct changed. Which means we can make v3 driver parse the zarr metadata and use the same logic to read the structured fields as in v2 driver?

What are your recommendations for implementation?

import numpy as np
import xarray as xr

dtype = np.dtype(
    {
        "names": ["foo", "bar"],
        "formats": ["int32", "int64"],
    }
)

encoding = {
    "headers": {"chunks": (128, 128)},
    "seismic": {"chunks": (16, 16, 16), "shards": (128, 128, 128)}
}
seis = xr.DataArray(name="seismic", dims=["inline", "crossline", "depth"], data=np.zeros((512, 512, 512), dtype="float32"))
hdr = xr.DataArray(name="headers", dims=["inline", "crossline"], data=np.zeros((512, 512), dtype=dtype))

ds = xr.Dataset({"seismic": seis, "headers": hdr})
ds.to_zarr("tmp", mode="w", zarr_format=3, encoding=encoding)

tasansal avatar Jul 15 '25 18:07 tasansal

Hi all, we are very eager to upgrade from Zarr V2 to the shiny V3, however our usecase requires structured data arrays. If there is not enough bandwidth to implement support in the driver we are happy to take a look and contribute back to the community. Any guidance would be greatly appreciated!

BrianMichell avatar Sep 04 '25 15:09 BrianMichell

Is there a reason that you can't store those two fields as separate arrays?

Tensorstore supports this for zarr v2 but exposes each field as a separate array. For zarr v3, in principle we could support this also but the implementation is non-trivial since tensorstore's own internal DataType representation does not support structs --- therefore we would need to modify the codec pipeline to handle a logical "array" represented as a list of separate per-field arrays.

jbms avatar Sep 04 '25 16:09 jbms

Is there a reason that you can't store those two fields as separate arrays?

Our full use-case is storing upwards of 90 discrete fields per element in the example hdr array (~240 bytes/element). We do extract the commonly accessed fields as their own arrays, however doing that for all fields would make the context of the Xarray Dataset difficult to understand and have negative connotations for performance.

Tensorstore supports this for zarr v2 but exposes each field as a separate array.

This was a major pain-point for us that we set out to resolve last year, and I'm happy to take another stab at this. We are more than happy to deal with raw packed bytes and handle interpretation ourselves in the case where manipulating a Store field-by-field does not make sense.

For zarr v3, in principle we could support this also but the implementation is non-trivial since tensorstore's own internal DataType representation does not support structs --- therefore we would need to modify the codec pipeline to handle a logical "array" represented as a list of separate per-field arrays.

I believe that we could re-use much of the zarr v2 implementation, however I haven't looked at the v3 implementation. Again, we are happy to make an attempt at this with some guidance and contribute back to the community.

BrianMichell avatar Sep 05 '25 00:09 BrianMichell