zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

TypeError when passing old numcodecs to zarr v3

Open TomNicholas opened this issue 9 months ago • 13 comments

Zarr version

v3.0.6

Numcodecs version

v0.16.0

Python Version

3.13

Operating System

mac

Installation

uv

Description

Passing the old stype of numcodecs codec to zarr raises a TypeError, when this scenario could be detected and upcast into the zarr-v3-compatible version of that codec instead.

This has been reported by a lot of xarray users (https://github.com/pydata/xarray/issues/10032) as well as here https://github.com/zarr-developers/zarr-python/issues/2710#issuecomment-2600974549.

Traceback (most recent call last):
  File "/Users/tom/Documents/Work/Code/experimentation/bugs/blosc/pure_zarr_mve.py", line 25, in <module>
    za = zarr.create_array(
        store,
    ...<4 lines>...
        compressors=compressors,
    )
  File "/Users/tom/.cache/uv/environments-v2/pure-zarr-mve-2145b34a8fc90dca/lib/python3.13/site-packages/zarr/api/synchronous.py", line 879, in create_array
    sync(
    ~~~~^
        zarr.core.array.create_array(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<19 lines>...
        )
        ^
    )
    ^
  File "/Users/tom/.cache/uv/environments-v2/pure-zarr-mve-2145b34a8fc90dca/lib/python3.13/site-packages/zarr/core/sync.py", line 163, in sync
    raise return_result
  File "/Users/tom/.cache/uv/environments-v2/pure-zarr-mve-2145b34a8fc90dca/lib/python3.13/site-packages/zarr/core/sync.py", line 119, in _runner
    return await coro
           ^^^^^^^^^^
  File "/Users/tom/.cache/uv/environments-v2/pure-zarr-mve-2145b34a8fc90dca/lib/python3.13/site-packages/zarr/core/array.py", line 4146, in create_array
    result = await init_array(
             ^^^^^^^^^^^^^^^^^
    ...<16 lines>...
    )
    ^
  File "/Users/tom/.cache/uv/environments-v2/pure-zarr-mve-2145b34a8fc90dca/lib/python3.13/site-packages/zarr/core/array.py", line 3961, in init_array
    array_array, array_bytes, bytes_bytes = _parse_chunk_encoding_v3(
                                            ~~~~~~~~~~~~~~~~~~~~~~~~^
        compressors=compressors,
        ^^^^^^^^^^^^^^^^^^^^^^^^
    ...<2 lines>...
        dtype=dtype_parsed,
        ^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/Users/tom/.cache/uv/environments-v2/pure-zarr-mve-2145b34a8fc90dca/lib/python3.13/site-packages/zarr/core/array.py", line 4330, in _parse_chunk_encoding_v3
    out_bytes_bytes = tuple(_parse_bytes_bytes_codec(c) for c in maybe_bytes_bytes)
  File "/Users/tom/.cache/uv/environments-v2/pure-zarr-mve-2145b34a8fc90dca/lib/python3.13/site-packages/zarr/core/array.py", line 4330, in <genexpr>
    out_bytes_bytes = tuple(_parse_bytes_bytes_codec(c) for c in maybe_bytes_bytes)
                            ~~~~~~~~~~~~~~~~~~~~~~~~^^^
  File "/Users/tom/.cache/uv/environments-v2/pure-zarr-mve-2145b34a8fc90dca/lib/python3.13/site-packages/zarr/registry.py", line 184, in _parse_bytes_bytes_codec
    raise TypeError(f"Expected a BytesBytesCodec. Got {type(data)} instead.")
TypeError: Expected a BytesBytesCodec. Got <class 'numcodecs.blosc.Blosc'> instead.

Steps to reproduce

# /// script
# requires-python = ">=3.13"
# dependencies = [
#     "numpy",
#     "zarr>=3",
# ]
# ///
import numpy as np
import zarr
import numcodecs

print(zarr.__version__)
print(numcodecs.__version__)

store = "/tmp/foo.zarr"
shape = (1024 * 1024 * 1024,)
chunks = (1024 * 1024 * 16,)
dtype = np.float64
fill_value = np.nan

# cname = "blosclz"
cname = "lz4"
compressors = [numcodecs.Blosc(cname="lz4")]

za = zarr.create_array(
    store,
    shape=shape,
    chunks=chunks,
    dtype=dtype,
    fill_value=fill_value,
    compressors=compressors,
)

Additional output

No response

TomNicholas avatar Apr 07 '25 16:04 TomNicholas

@normanrz - you know this part of the code best. Do you think its reasonable for us to cast vanilla numcodecs codecs to zarr3 codecs? Seems like we have everything we need to make the right decisions here.

jhamman avatar Apr 07 '25 16:04 jhamman

Note that if I change the compressors line to this then it works

compressors = [zarr.codecs.BloscCodec(cname="zstd", clevel=3, shuffle="shuffle")]

TomNicholas avatar Apr 07 '25 16:04 TomNicholas

@normanrz - you know this part of the code best. Do you think its reasonable for us to cast vanilla numcodecs codecs to zarr3 codecs? Seems like we have everything we need to make the right decisions here.

Yes, upcasting is certainly possible. Whether to do that here in zarr or in numcodecs invokes the usual cyclic dependency issue. My gut feeling would be that a to_zarr3 function in numcodecs.zarr3 would be better placed, though.

normanrz avatar Apr 07 '25 16:04 normanrz

just adding weight to this ticket... [v3.0.6] Expected a BytesBytesCodec. Got <class 'numcodecs.blosc.Blosc'> instead.. Skipping.

looking forward to the unified/aligned implementation, thanks devteam!

fowlerovski avatar Apr 09 '25 23:04 fowlerovski

We really need this to work, because it's preventing people using this pattern to move their zarr v2 data into zarr v3 data via xarray:

ds = xr.open_zarr('store-v2.zarr')
ds.to_zarr('store-v3.zarr')

TomNicholas avatar Apr 11 '25 15:04 TomNicholas

Bumping for priority here; also impacting an upgrade to a production workflow that I'd like to quickly migrate to Zarr v3.

darothen avatar Apr 21 '25 15:04 darothen

ds.drop_encoding().to_zarr("store-v3.zarr") should work, as long as you're ok with default compression

dcherian avatar Apr 21 '25 16:04 dcherian

Confirm that the defaults all work just fine when writing a new Dataset created in-memory to Zarr using latest mainline releases of zarr-python and numcodecs.

Still looking for clarity on defining custom encoding/compressors. The workflow I'm migrating to Zarr v3 previously had some fine-tuning done to create a compression scheme that balanced output size and runtime. Using the original way to setup this up - e.g. instantiate a numcodecs.Blosc as in @TomNicholas original top post - continues to produce the error message in this comment.

darothen avatar Apr 21 '25 20:04 darothen

Yes, upcasting is certainly possible. Whether to do that here in zarr or in numcodecs invokes the usual cyclic dependency issue. My gut feeling would be that a to_zarr3 function in numcodecs.zarr3 would be better placed, though.

I made a draft for this in zarr-developers/numcodecs#741. feel free to leave a comment.

brokkoli71 avatar Apr 22 '25 13:04 brokkoli71

Hi all, I see that the TypeError fix is currently sitting in a pull request. How close is that PR to being merged? I'm in the process of converting several USGS datasets to IceChunk (which requires Zarr v3) and have run into the same TypeError. Thanks!

CC: @aufdenkampe

kieranbartels avatar Jun 10 '25 20:06 kieranbartels

This bug appeared when @kieranbartels tried to save a slice of the NCAR/USGS CONUS404 dataset using python-zarr v3. As a result, this issue will be a blocker to having USGS Water Mission Area update their data systems to the latest, along with:

  • https://github.com/zarr-developers/zarr-python/pull/2774

Fortunately, @maxrjones & @d-v-b are working on a fix for that.

We would very much appreciate prioritizing a fix for this issue, so we can move forward with Zarr 3.

aufdenkampe avatar Jun 12 '25 15:06 aufdenkampe

For everyone blocked by this issue, please be aware there is a simple workaround: update your code to specify your codecs using the Zarr-3 style, as described in the docs, rather than using the old-style codec API.

As @dcherian noted above, when opening Zarr data, Xarray automatically stores each array's encoding details in the .encoding namespace and by default uses these settings when writing new data. So when opening V2 data and then writing V3 data, Xarray is passing along the old V2-style codec specification, leading to the TypeError you are seeing.

To change this behavior you can either:

  • edit the encoding directly, e.g. ds.foo.encoding['compressors'] = [zarr.codecs.ZstdCodec()]
  • specify the encoding during to_zarr(..., encoding={"foo": {"compressors": [zarr.codecs.ZstdCodec()]}})
  • drop the encoding, e.g.ds.drop_encoding() and use the default encoding

I realize these workarounds are not as convenient as having Zarr automatically translate the codec settings from V2-style to V3-style. (We are working on that.) However, no one needs to be blocked from using Zarr 3 or Icechunk because of this issue.

rabernat avatar Jun 12 '25 15:06 rabernat

@rabernat, thank you for providing a clear summary of the three workarounds to this issue!

aufdenkampe avatar Jun 13 '25 17:06 aufdenkampe