zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

Add v2 and v3 metadata support to codecs

Open d-v-b opened this issue 4 months ago • 12 comments

This PR will give each codec a v2 and v3 JSON de/serialization routines.

depends on #3318

d-v-b avatar Aug 04 '25 14:08 d-v-b

Codecov Report

:x: Patch coverage is 31.55985% with 1132 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 57.40%. Comparing base (b3e9aed) to head (ff7a1a5).

Files with missing lines Patch % Lines
src/zarr/codecs/_v2.py 4.05% 71 Missing :warning:
src/zarr/core/codec_pipeline.py 37.64% 53 Missing :warning:
src/zarr/codecs/numcodecs/fixed_scale_offset.py 33.33% 40 Missing :warning:
src/zarr/codecs/numcodecs/zfpy.py 0.00% 40 Missing :warning:
src/zarr/codecs/blosc.py 33.33% 38 Missing :warning:
src/zarr/codecs/numcodecs/pcodec.py 0.00% 38 Missing :warning:
src/zarr/codecs/numcodecs/delta.py 38.33% 37 Missing :warning:
src/zarr/codecs/numcodecs/crc32c.py 0.00% 36 Missing :warning:
src/zarr/codecs/numcodecs/astype.py 36.36% 35 Missing :warning:
src/zarr/codecs/vlen_utf8.py 35.84% 34 Missing :warning:
... and 35 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3332      +/-   ##
==========================================
- Coverage   61.77%   57.40%   -4.37%     
==========================================
  Files          85      106      +21     
  Lines       10165    11391    +1226     
==========================================
+ Hits         6279     6539     +260     
- Misses       3886     4852     +966     
Files with missing lines Coverage Δ
src/zarr/abc/store.py 38.29% <ø> (ø)
src/zarr/core/config.py 38.88% <ø> (ø)
src/zarr/core/indexing.py 69.35% <ø> (-0.12%) :arrow_down:
src/zarr/metadata/migrate_v3.py 97.43% <100.00%> (-0.92%) :arrow_down:
src/zarr/core/dtype/__init__.py 30.00% <0.00%> (ø)
src/zarr/core/metadata/v3.py 58.36% <83.33%> (+0.36%) :arrow_up:
src/zarr/errors.py 9.37% <0.00%> (-0.31%) :arrow_down:
src/zarr/abc/numcodec.py 8.33% <0.00%> (-7.06%) :arrow_down:
src/zarr/api/asynchronous.py 72.20% <0.00%> (ø)
src/zarr/registry.py 64.17% <57.14%> (+0.78%) :arrow_up:
... and 39 more
:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Sep 02 '25 21:09 codecov[bot]

this is now in a phase where I would really appreciate eyes from @zarr-developers/python-core-devs.

The goal of this PR is twofold:

  1. To use the exact same codec classes to create Zarr V2 and Zarr V3 arrays. This will resolve a major problem for v2 -> v3 conversion. Today people have to switch out their codec classes, even when going from numcodecs.Gzip to zarr.codecs.gzip. This PR handles that translation internally.

  2. to gracefully handle numcodecs codecs, which includes codecs like numcodecs.gzip which is basically unchanged in zarr v3, to numcodecs.blosc which received breaking changes in zarr v3, all the way to codecs defined in imagecodecs like the jpeg codec, which isn't even standardized yet. This PR handles all three of those cases when creating zarr v2 or v3 arrays (because of goal 1).

In this PR, when a user shows up with compressor={"id": "gzip", "level": 1} or compressor=numcodecs.Gzip(level=1), we will resolve the compressor to an instance of zarr.codecs.GZipCodec. When a user shows up with a completely unknown codec that adheres to the numcodecs API, we will wrap the codec in a special wrapper class and make a best-effort attempt to create a valid codec pipeline around that codec.

This PR also adds typeddict classes for the v2 and v3 form of each codec, which was laborious but IMO worth it for type safety.

If you have time, please look this over and / or test this on your v2 -> v3 workloads. That would be extremely helpful. I think these changes are on the same scale as the data type changes, so this requires a lot of finesse and potentially follow-up PRs.

d-v-b avatar Sep 24 '25 13:09 d-v-b

I think the last thing I need to do here is write a test that ensures compatibility between these changes and older versions of zarr python 3.x.

d-v-b avatar Sep 29 '25 15:09 d-v-b

I am concerned that the additions to the codec API in this PR will be disruptive to people who implemented custom Zarr V3 codecs, e.g.., anyone who defined a class that inherited from in zarr.abc.codec. That argues for a totally new codec API, which is not a prospect I take lightly, but I think it's the best way to avoid breaking existing workflows.

d-v-b avatar Sep 30 '25 15:09 d-v-b

I am concerned that the additions to the codec API in this PR will be disruptive to people who implemented custom Zarr V3 codecs, e.g.., anyone who defined a class that inherited from in zarr.abc.codec. That argues for a totally new codec API, which is not a prospect I take lightly, but I think it's the best way to avoid breaking existing workflows.

yeah currently in virtual-tiff I get lots of failures (119 failed, 317 passed) with most due to KeyError: 'configuration' and some due to TypeError: Invalid JSON: {'name': 'numcodecs.zlib', 'configuration': {}}.

I think this would probably be an issue for other parsers too, such as gribberish and Sean's new HRRRarser.

The VirtualiZarr tests also fail with errors such as: TypeError: Zlib.__init__() got an unexpected keyword argument 'name'.

maxrjones avatar Oct 03 '25 19:10 maxrjones

I am concerned that the additions to the codec API in this PR will be disruptive to people who implemented custom Zarr V3 codecs, e.g.., anyone who defined a class that inherited from in zarr.abc.codec. That argues for a totally new codec API, which is not a prospect I take lightly, but I think it's the best way to avoid breaking existing workflows.

yeah currently in virtual-tiff I get lots of failures (119 failed, 317 passed) with most due to KeyError: 'configuration' and some due to TypeError: Invalid JSON: {'name': 'numcodecs.zlib', 'configuration': {}}.

I think this would probably be an issue for other parsers too, such as gribberish and Sean's new HRRRarser.

The VirtualiZarr tests also fail with errors such as: TypeError: Zlib.__init__() got an unexpected keyword argument 'name'.

This is super useful feedback. I'll add virtual-tiff as a dev dependency while I work out how to make these changes non-breaking.

d-v-b avatar Oct 03 '25 20:10 d-v-b

yeah currently in virtual-tiff I get lots of failures (119 failed, 317 passed) with most due to KeyError: 'configuration' and some due to TypeError: Invalid JSON: {'name': 'numcodecs.zlib', 'configuration': {}}.

More context for this:

>>> from numcodecs.zarr3 import Zlib as ZlibV3
>>> from numcodecs import Zlib
>>> Zlib().get_config()
{'id': 'zlib', 'level': 1}
>>> ZlibV3().to_dict()
/Users/d-v-b/.cache/uv/archive-v0/RPIFUeEX8IUCTWZnqf1cL/lib/python3.12/site-packages/numcodecs/zarr3.py:164: UserWarning: Numcodecs codecs are not in the Zarr version 3 specification and may not be supported by other zarr implementations.
  super().__init__(**codec_config)
{'name': 'numcodecs.zlib', 'configuration': {}}
>>> ZlibV3(fake_param=10).to_dict()
{'name': 'numcodecs.zlib', 'configuration': {'fake_param': 10}}

What you see here is a massive flaw in the slapdash design of the codecs in numcodecs.zarr3, which is that __init__ does not inspect the parameters at all! Zlib is configured with a level parameter, which has a default value of 1 for numcodecs.Zlib. But numcodecs.zarr3.Zlib doesn't know anything about the codec it wraps, and so it doesn't generate the default level parameter. This means numcodecs.zarr3.Zlib generates invalid zlib metadata! Great stuff.

d-v-b avatar Oct 03 '25 21:10 d-v-b

Recent changes in this PR:

  • all the codecs in zarr.codecs.numcodecs produce JSON like {name: "astype", "configuration": {"dtype": "int8", "astype": "uint8"}}. Note two things: the name does not have the numcodecs. prefix, and the configuration has been changed to be zarr v3 compliant (for example, using Zarr v3 data type JSON, instead of zarr v2 data type JSON). This is a breaking change to the JSON form produced by these codecs. Old versions of Zarr python will not be able to read this metadata.
  • all the codecs in zarr.codecs.numcodecs consume JSON formatted as described in the previous bullet point, but also the old style of {"name": "numcodecs.astype", "configuration": {"dtype": "|i1", "astype": "|u1"}}. This means they are compatible with old data.

I think this breakage is warranted for a few reasons:

  • we do not want to normalize the "numcodecs" name prefix. It is confusing, and potentially a source of needless divergence in the codec ecosystem. For example, we currently have "numcodecs.blosc", which produces different, incompatible zarr v3 JSON from the regular blosc codec defined in the zarr v3 spec. I find this extremely problematic and worth stopping. This PR effectively makes the zarr.codecs.numcodecs.Blosc codec an alias for zarr.codecs.BloscCodec.
  • we do not want abstraction leakage from zarr v2 into zarr v3. We have several codecs that use data type identifiers as part of their definition, like astype, delta, and fixedscaleoffset. The zarr.codecs.numcodecs versions of these codecs currently use the zarr v2 data type identifiers that numcodecs understands. This is highly problematic, because we have perfectly good zarr v3 data type identifiers that we chose specifically to decouple ourselves from numpy. Disseminating codecs that use the old numpy-style data type identifiers is asking for headaches.

So a basic question for @zarr-developers/python-core-devs: how much do we value preserving the current, problematic JSON serialization of the codecs in zarr.codecs.numcodecs, versus the value of keeping the codec ecosystem simpler?

d-v-b avatar Oct 28 '25 17:10 d-v-b

  • all the codecs in zarr.codecs.numcodecs produce JSON like {name: "astype", "configuration": {"dtype": "int8", "astype": "uint8"}}. Note two things: the name does not have the numcodecs. prefix, and the configuration has been changed to be zarr v3 compliant (for example, using Zarr v3 data type JSON, instead of zarr v2 data type JSON). This is a breaking change to the JSON form produced by these codecs. Old versions of Zarr python will not be able to read this metadata.
  • all the codecs in zarr.codecs.numcodecs consume JSON formatted as described in the previous bullet point, but also the old style of {"name": "numcodecs.astype", "configuration": {"dtype": "|i1", "astype": "|u1"}}. This means they are compatible with old data.

@d-v-b this behavior is consistent with the alias solution discussed in https://github.com/zarr-developers/zarr-extensions/pull/2, right? I support using the more interoperable alias for serialization moving forward, but ideally would like https://github.com/zarr-developers/zarr-extensions/pull/2 to be finalized/merged at the same time.

I have two other questions:

  • How do you handle a codec that doesn't have a configuration object? I couldn't quite tell from https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html#extension-definition whether the configuration key should be omitted or the object should be None or empty.
  • Can the codecs consume/produce v2 style filters/compressors/serializers (e.g., {id: "astype", "dtype": "int8", "astype": "uint8"})?

maxrjones avatar Nov 03 '25 15:11 maxrjones

whether the configuration key should be omitted or the object should be None or empty.

For a codec that takes no configuration at all, {'name': 'foo', 'configuration': {}} or {'name': 'foo'} or 'foo' (plain string) are all synonyms, as per the 3.1 spec.

Can the codecs consume/produce v2 style filters/compressors/serializers (e.g., {id: "astype", "dtype": "int8", "astype": "uint8"})?

In this PR yes, all the codecs that handle dtype stuff in their configuration will take the v2 dtypes metadata, even for a v3 codec. this is necessary for compatibility with existing data. But this PR makes a breaking change by ensuring that the v3 JSON form of such a codec uses the zarr v3 data type metadata.

d-v-b avatar Nov 03 '25 16:11 d-v-b

But this PR makes a breaking change by ensuring that the v3 JSON form of such a codec uses the zarr v3 data type metadata.

I agree with making this breaking change, favoring a simpler ecosystem over bug-for-bug compatibility.

maxrjones avatar Nov 03 '25 16:11 maxrjones

This change would be great! See also the zarr-any-numcodecs package that I created a while ago as a stop-gap solution for this problem

juntyr avatar Dec 08 '25 12:12 juntyr