TypeError when passing old numcodecs to zarr v3
Zarr version
v3.0.6
Numcodecs version
v0.16.0
Python Version
3.13
Operating System
mac
Installation
uv
Description
Passing the old stype of numcodecs codec to zarr raises a TypeError, when this scenario could be detected and upcast into the zarr-v3-compatible version of that codec instead.
This has been reported by a lot of xarray users (https://github.com/pydata/xarray/issues/10032) as well as here https://github.com/zarr-developers/zarr-python/issues/2710#issuecomment-2600974549.
Traceback (most recent call last):
File "/Users/tom/Documents/Work/Code/experimentation/bugs/blosc/pure_zarr_mve.py", line 25, in <module>
za = zarr.create_array(
store,
...<4 lines>...
compressors=compressors,
)
File "/Users/tom/.cache/uv/environments-v2/pure-zarr-mve-2145b34a8fc90dca/lib/python3.13/site-packages/zarr/api/synchronous.py", line 879, in create_array
sync(
~~~~^
zarr.core.array.create_array(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...<19 lines>...
)
^
)
^
File "/Users/tom/.cache/uv/environments-v2/pure-zarr-mve-2145b34a8fc90dca/lib/python3.13/site-packages/zarr/core/sync.py", line 163, in sync
raise return_result
File "/Users/tom/.cache/uv/environments-v2/pure-zarr-mve-2145b34a8fc90dca/lib/python3.13/site-packages/zarr/core/sync.py", line 119, in _runner
return await coro
^^^^^^^^^^
File "/Users/tom/.cache/uv/environments-v2/pure-zarr-mve-2145b34a8fc90dca/lib/python3.13/site-packages/zarr/core/array.py", line 4146, in create_array
result = await init_array(
^^^^^^^^^^^^^^^^^
...<16 lines>...
)
^
File "/Users/tom/.cache/uv/environments-v2/pure-zarr-mve-2145b34a8fc90dca/lib/python3.13/site-packages/zarr/core/array.py", line 3961, in init_array
array_array, array_bytes, bytes_bytes = _parse_chunk_encoding_v3(
~~~~~~~~~~~~~~~~~~~~~~~~^
compressors=compressors,
^^^^^^^^^^^^^^^^^^^^^^^^
...<2 lines>...
dtype=dtype_parsed,
^^^^^^^^^^^^^^^^^^^
)
^
File "/Users/tom/.cache/uv/environments-v2/pure-zarr-mve-2145b34a8fc90dca/lib/python3.13/site-packages/zarr/core/array.py", line 4330, in _parse_chunk_encoding_v3
out_bytes_bytes = tuple(_parse_bytes_bytes_codec(c) for c in maybe_bytes_bytes)
File "/Users/tom/.cache/uv/environments-v2/pure-zarr-mve-2145b34a8fc90dca/lib/python3.13/site-packages/zarr/core/array.py", line 4330, in <genexpr>
out_bytes_bytes = tuple(_parse_bytes_bytes_codec(c) for c in maybe_bytes_bytes)
~~~~~~~~~~~~~~~~~~~~~~~~^^^
File "/Users/tom/.cache/uv/environments-v2/pure-zarr-mve-2145b34a8fc90dca/lib/python3.13/site-packages/zarr/registry.py", line 184, in _parse_bytes_bytes_codec
raise TypeError(f"Expected a BytesBytesCodec. Got {type(data)} instead.")
TypeError: Expected a BytesBytesCodec. Got <class 'numcodecs.blosc.Blosc'> instead.
Steps to reproduce
# /// script
# requires-python = ">=3.13"
# dependencies = [
# "numpy",
# "zarr>=3",
# ]
# ///
import numpy as np
import zarr
import numcodecs
print(zarr.__version__)
print(numcodecs.__version__)
store = "/tmp/foo.zarr"
shape = (1024 * 1024 * 1024,)
chunks = (1024 * 1024 * 16,)
dtype = np.float64
fill_value = np.nan
# cname = "blosclz"
cname = "lz4"
compressors = [numcodecs.Blosc(cname="lz4")]
za = zarr.create_array(
store,
shape=shape,
chunks=chunks,
dtype=dtype,
fill_value=fill_value,
compressors=compressors,
)
Additional output
No response
@normanrz - you know this part of the code best. Do you think its reasonable for us to cast vanilla numcodecs codecs to zarr3 codecs? Seems like we have everything we need to make the right decisions here.
Note that if I change the compressors line to this then it works
compressors = [zarr.codecs.BloscCodec(cname="zstd", clevel=3, shuffle="shuffle")]
@normanrz - you know this part of the code best. Do you think its reasonable for us to cast vanilla numcodecs codecs to zarr3 codecs? Seems like we have everything we need to make the right decisions here.
Yes, upcasting is certainly possible. Whether to do that here in zarr or in numcodecs invokes the usual cyclic dependency issue. My gut feeling would be that a to_zarr3 function in numcodecs.zarr3 would be better placed, though.
just adding weight to this ticket... [v3.0.6]
Expected a BytesBytesCodec. Got <class 'numcodecs.blosc.Blosc'> instead.. Skipping.
looking forward to the unified/aligned implementation, thanks devteam!
We really need this to work, because it's preventing people using this pattern to move their zarr v2 data into zarr v3 data via xarray:
ds = xr.open_zarr('store-v2.zarr')
ds.to_zarr('store-v3.zarr')
Bumping for priority here; also impacting an upgrade to a production workflow that I'd like to quickly migrate to Zarr v3.
ds.drop_encoding().to_zarr("store-v3.zarr") should work, as long as you're ok with default compression
Confirm that the defaults all work just fine when writing a new Dataset created in-memory to Zarr using latest mainline releases of zarr-python and numcodecs.
Still looking for clarity on defining custom encoding/compressors. The workflow I'm migrating to Zarr v3 previously had some fine-tuning done to create a compression scheme that balanced output size and runtime. Using the original way to setup this up - e.g. instantiate a numcodecs.Blosc as in @TomNicholas original top post - continues to produce the error message in this comment.
Yes, upcasting is certainly possible. Whether to do that here in zarr or in numcodecs invokes the usual cyclic dependency issue. My gut feeling would be that a
to_zarr3function innumcodecs.zarr3would be better placed, though.
I made a draft for this in zarr-developers/numcodecs#741. feel free to leave a comment.
Hi all, I see that the TypeError fix is currently sitting in a pull request. How close is that PR to being merged? I'm in the process of converting several USGS datasets to IceChunk (which requires Zarr v3) and have run into the same TypeError. Thanks!
CC: @aufdenkampe
This bug appeared when @kieranbartels tried to save a slice of the NCAR/USGS CONUS404 dataset using python-zarr v3. As a result, this issue will be a blocker to having USGS Water Mission Area update their data systems to the latest, along with:
- https://github.com/zarr-developers/zarr-python/pull/2774
Fortunately, @maxrjones & @d-v-b are working on a fix for that.
We would very much appreciate prioritizing a fix for this issue, so we can move forward with Zarr 3.
For everyone blocked by this issue, please be aware there is a simple workaround: update your code to specify your codecs using the Zarr-3 style, as described in the docs, rather than using the old-style codec API.
As @dcherian noted above, when opening Zarr data, Xarray automatically stores each array's encoding details in the .encoding namespace and by default uses these settings when writing new data. So when opening V2 data and then writing V3 data, Xarray is passing along the old V2-style codec specification, leading to the TypeError you are seeing.
To change this behavior you can either:
- edit the encoding directly, e.g.
ds.foo.encoding['compressors'] = [zarr.codecs.ZstdCodec()] - specify the encoding during
to_zarr(..., encoding={"foo": {"compressors": [zarr.codecs.ZstdCodec()]}}) - drop the encoding, e.g.
ds.drop_encoding()and use the default encoding
I realize these workarounds are not as convenient as having Zarr automatically translate the codec settings from V2-style to V3-style. (We are working on that.) However, no one needs to be blocked from using Zarr 3 or Icechunk because of this issue.
@rabernat, thank you for providing a clear summary of the three workarounds to this issue!