zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

[v3] Fixed-width unicode string support in zarr v3

Open TomAugspurger opened this issue 1 year ago • 0 comments

Zarr version

v3

Numcodecs version

na

Python Version

na

Operating System

na

Installation

na

Description

Mentioned in https://github.com/zarr-developers/zarr-python/pull/2323#issuecomment-2407566652, right now we can't create a fixed-width string dtype in zarr v3.

In [1]: import zarr

In [2]: arr = zarr.create(shape=(3,), dtype="U3")

In [3]: arr[:] = ['a', 'bb', 'ccc']

In [4]: arr[:]
Out[4]: array(['a', 'bb', 'ccc'], dtype=StringDType())

We would want the NumPy dtype of that array to be U3, a fixed-width unicode string dtype. We'd want to support this in addition to the variable width strings being used currently. Some initial questions I don't know the answer to:

  1. What data_type shows up in the metadata?
  2. What codecs are needed?
  3. How are the actual bytes stored? In parquet, fixed_len_byte_array is one of the primitive types.

Steps to reproduce

.

Additional output

No response

TomAugspurger avatar Oct 12 '24 19:10 TomAugspurger