Add v2 and v3 metadata support to codecs
This PR will give each codec a v2 and v3 JSON de/serialization routines.
depends on #3318
Codecov Report
:x: Patch coverage is 31.55985% with 1132 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 57.40%. Comparing base (b3e9aed) to head (ff7a1a5).
Additional details and impacted files
@@ Coverage Diff @@
## main #3332 +/- ##
==========================================
- Coverage 61.77% 57.40% -4.37%
==========================================
Files 85 106 +21
Lines 10165 11391 +1226
==========================================
+ Hits 6279 6539 +260
- Misses 3886 4852 +966
| Files with missing lines | Coverage Δ | |
|---|---|---|
| src/zarr/abc/store.py | 38.29% <ø> (ø) |
|
| src/zarr/core/config.py | 38.88% <ø> (ø) |
|
| src/zarr/core/indexing.py | 69.35% <ø> (-0.12%) |
:arrow_down: |
| src/zarr/metadata/migrate_v3.py | 97.43% <100.00%> (-0.92%) |
:arrow_down: |
| src/zarr/core/dtype/__init__.py | 30.00% <0.00%> (ø) |
|
| src/zarr/core/metadata/v3.py | 58.36% <83.33%> (+0.36%) |
:arrow_up: |
| src/zarr/errors.py | 9.37% <0.00%> (-0.31%) |
:arrow_down: |
| src/zarr/abc/numcodec.py | 8.33% <0.00%> (-7.06%) |
:arrow_down: |
| src/zarr/api/asynchronous.py | 72.20% <0.00%> (ø) |
|
| src/zarr/registry.py | 64.17% <57.14%> (+0.78%) |
:arrow_up: |
| ... and 39 more |
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
this is now in a phase where I would really appreciate eyes from @zarr-developers/python-core-devs.
The goal of this PR is twofold:
-
To use the exact same codec classes to create Zarr V2 and Zarr V3 arrays. This will resolve a major problem for v2 -> v3 conversion. Today people have to switch out their codec classes, even when going from
numcodecs.Gziptozarr.codecs.gzip. This PR handles that translation internally. -
to gracefully handle numcodecs codecs, which includes codecs like
numcodecs.gzipwhich is basically unchanged in zarr v3, tonumcodecs.bloscwhich received breaking changes in zarr v3, all the way to codecs defined inimagecodecslike the jpeg codec, which isn't even standardized yet. This PR handles all three of those cases when creating zarr v2 or v3 arrays (because of goal 1).
In this PR, when a user shows up with compressor={"id": "gzip", "level": 1} or compressor=numcodecs.Gzip(level=1), we will resolve the compressor to an instance of zarr.codecs.GZipCodec. When a user shows up with a completely unknown codec that adheres to the numcodecs API, we will wrap the codec in a special wrapper class and make a best-effort attempt to create a valid codec pipeline around that codec.
This PR also adds typeddict classes for the v2 and v3 form of each codec, which was laborious but IMO worth it for type safety.
If you have time, please look this over and / or test this on your v2 -> v3 workloads. That would be extremely helpful. I think these changes are on the same scale as the data type changes, so this requires a lot of finesse and potentially follow-up PRs.
I think the last thing I need to do here is write a test that ensures compatibility between these changes and older versions of zarr python 3.x.
I am concerned that the additions to the codec API in this PR will be disruptive to people who implemented custom Zarr V3 codecs, e.g.., anyone who defined a class that inherited from in zarr.abc.codec. That argues for a totally new codec API, which is not a prospect I take lightly, but I think it's the best way to avoid breaking existing workflows.
I am concerned that the additions to the codec API in this PR will be disruptive to people who implemented custom Zarr V3 codecs, e.g.., anyone who defined a class that inherited from in
zarr.abc.codec. That argues for a totally new codec API, which is not a prospect I take lightly, but I think it's the best way to avoid breaking existing workflows.
yeah currently in virtual-tiff I get lots of failures (119 failed, 317 passed) with most due to KeyError: 'configuration' and some due to TypeError: Invalid JSON: {'name': 'numcodecs.zlib', 'configuration': {}}.
I think this would probably be an issue for other parsers too, such as gribberish and Sean's new HRRRarser.
The VirtualiZarr tests also fail with errors such as: TypeError: Zlib.__init__() got an unexpected keyword argument 'name'.
I am concerned that the additions to the codec API in this PR will be disruptive to people who implemented custom Zarr V3 codecs, e.g.., anyone who defined a class that inherited from in
zarr.abc.codec. That argues for a totally new codec API, which is not a prospect I take lightly, but I think it's the best way to avoid breaking existing workflows.yeah currently in virtual-tiff I get lots of failures (119 failed, 317 passed) with most due to
KeyError: 'configuration'and some due toTypeError: Invalid JSON: {'name': 'numcodecs.zlib', 'configuration': {}}.I think this would probably be an issue for other parsers too, such as gribberish and Sean's new HRRRarser.
The VirtualiZarr tests also fail with errors such as:
TypeError: Zlib.__init__() got an unexpected keyword argument 'name'.
This is super useful feedback. I'll add virtual-tiff as a dev dependency while I work out how to make these changes non-breaking.
yeah currently in virtual-tiff I get lots of failures (119 failed, 317 passed) with most due to
KeyError: 'configuration'and some due toTypeError: Invalid JSON: {'name': 'numcodecs.zlib', 'configuration': {}}.
More context for this:
>>> from numcodecs.zarr3 import Zlib as ZlibV3
>>> from numcodecs import Zlib
>>> Zlib().get_config()
{'id': 'zlib', 'level': 1}
>>> ZlibV3().to_dict()
/Users/d-v-b/.cache/uv/archive-v0/RPIFUeEX8IUCTWZnqf1cL/lib/python3.12/site-packages/numcodecs/zarr3.py:164: UserWarning: Numcodecs codecs are not in the Zarr version 3 specification and may not be supported by other zarr implementations.
super().__init__(**codec_config)
{'name': 'numcodecs.zlib', 'configuration': {}}
>>> ZlibV3(fake_param=10).to_dict()
{'name': 'numcodecs.zlib', 'configuration': {'fake_param': 10}}
What you see here is a massive flaw in the slapdash design of the codecs in numcodecs.zarr3, which is that __init__ does not inspect the parameters at all! Zlib is configured with a level parameter, which has a default value of 1 for numcodecs.Zlib. But numcodecs.zarr3.Zlib doesn't know anything about the codec it wraps, and so it doesn't generate the default level parameter. This means numcodecs.zarr3.Zlib generates invalid zlib metadata! Great stuff.
Recent changes in this PR:
- all the codecs in
zarr.codecs.numcodecsproduce JSON like{name: "astype", "configuration": {"dtype": "int8", "astype": "uint8"}}. Note two things: the name does not have thenumcodecs.prefix, and the configuration has been changed to be zarr v3 compliant (for example, using Zarr v3 data type JSON, instead of zarr v2 data type JSON). This is a breaking change to the JSON form produced by these codecs. Old versions of Zarr python will not be able to read this metadata. - all the codecs in
zarr.codecs.numcodecsconsume JSON formatted as described in the previous bullet point, but also the old style of{"name": "numcodecs.astype", "configuration": {"dtype": "|i1", "astype": "|u1"}}. This means they are compatible with old data.
I think this breakage is warranted for a few reasons:
- we do not want to normalize the "numcodecs" name prefix. It is confusing, and potentially a source of needless divergence in the codec ecosystem. For example, we currently have
"numcodecs.blosc", which produces different, incompatible zarr v3 JSON from the regular blosc codec defined in the zarr v3 spec. I find this extremely problematic and worth stopping. This PR effectively makes thezarr.codecs.numcodecs.Blosccodec an alias forzarr.codecs.BloscCodec. - we do not want abstraction leakage from zarr v2 into zarr v3. We have several codecs that use data type identifiers as part of their definition, like
astype,delta, andfixedscaleoffset. Thezarr.codecs.numcodecsversions of these codecs currently use the zarr v2 data type identifiers that numcodecs understands. This is highly problematic, because we have perfectly good zarr v3 data type identifiers that we chose specifically to decouple ourselves from numpy. Disseminating codecs that use the old numpy-style data type identifiers is asking for headaches.
So a basic question for @zarr-developers/python-core-devs: how much do we value preserving the current, problematic JSON serialization of the codecs in zarr.codecs.numcodecs, versus the value of keeping the codec ecosystem simpler?
- all the codecs in zarr.codecs.numcodecs produce JSON like {name: "astype", "configuration": {"dtype": "int8", "astype": "uint8"}}. Note two things: the name does not have the numcodecs. prefix, and the configuration has been changed to be zarr v3 compliant (for example, using Zarr v3 data type JSON, instead of zarr v2 data type JSON). This is a breaking change to the JSON form produced by these codecs. Old versions of Zarr python will not be able to read this metadata.
- all the codecs in zarr.codecs.numcodecs consume JSON formatted as described in the previous bullet point, but also the old style of {"name": "numcodecs.astype", "configuration": {"dtype": "|i1", "astype": "|u1"}}. This means they are compatible with old data.
@d-v-b this behavior is consistent with the alias solution discussed in https://github.com/zarr-developers/zarr-extensions/pull/2, right? I support using the more interoperable alias for serialization moving forward, but ideally would like https://github.com/zarr-developers/zarr-extensions/pull/2 to be finalized/merged at the same time.
I have two other questions:
- How do you handle a codec that doesn't have a configuration object? I couldn't quite tell from https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html#extension-definition whether the configuration key should be omitted or the object should be None or empty.
- Can the codecs consume/produce v2 style filters/compressors/serializers (e.g.,
{id: "astype", "dtype": "int8", "astype": "uint8"})?
whether the configuration key should be omitted or the object should be None or empty.
For a codec that takes no configuration at all, {'name': 'foo', 'configuration': {}} or {'name': 'foo'} or 'foo' (plain string) are all synonyms, as per the 3.1 spec.
Can the codecs consume/produce v2 style filters/compressors/serializers (e.g., {id: "astype", "dtype": "int8", "astype": "uint8"})?
In this PR yes, all the codecs that handle dtype stuff in their configuration will take the v2 dtypes metadata, even for a v3 codec. this is necessary for compatibility with existing data. But this PR makes a breaking change by ensuring that the v3 JSON form of such a codec uses the zarr v3 data type metadata.
But this PR makes a breaking change by ensuring that the v3 JSON form of such a codec uses the zarr v3 data type metadata.
I agree with making this breaking change, favoring a simpler ecosystem over bug-for-bug compatibility.
This change would be great! See also the zarr-any-numcodecs package that I created a while ago as a stop-gap solution for this problem