zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

default filters for v2 object dtype are wrong

Open d-v-b opened this issue 1 year ago • 3 comments

this example does not work in main:

def test_x() -> None:
    array = create(
        store={},
        path='foo',
        dtype='O',
        zarr_format=2,
        shape=(3,)
        )
    array[:] = np.array(['a', 'b', 'c'], dtype='O')

The problem is caused because zarr chooses the wrong default filters for O dtype arrays -- zarr chooses VlenBytes, when it should be choosing VlenUTF8. Since the structure of the default codecs is likely to change soon, the fix should probably be made in the context of #2463

cc @dcherian

d-v-b avatar Jan 02 '25 20:01 d-v-b

Seems pretty major, so I've added to the 3.0 milestone - we should at least document this as a known issue!

dstansby avatar Jan 03 '25 10:01 dstansby

this is still broken on main. I can "fix" it by associating the dtype string O with the VlenUTF8 codec instead of VlenBytes, but that shouldn't work in general for O dtype arrays (and it breaks some tests).

knowing little about how zarr encodes arbitrary python objects, I checked how this works in v2, and it seems that zarr.create(dtype='O') would error if an object_codec was not provided. Not sure we want to emulate that. The original bug report came from xarray's integration tests with zarr main; I can look into how those tests are calling zarr APIs exactly

d-v-b avatar Jan 04 '25 23:01 d-v-b

I'm adding this to the known issues section in the v3 migration guide and marking it as After 3.0.0.

jhamman avatar Jan 08 '25 00:01 jhamman