zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

Refactor CodecPipeline for flexibility

Open TomAugspurger opened this issue 8 months ago • 1 comments

Zarr version

v3

Numcodecs version

na

Python Version

na

Operating System

na

Installation

na

Description

Currently, the CodecPipeline interface works by passing around Iterable[tuple[...]] for various types of tuples. For example decode: https://github.com/zarr-developers/zarr-python/blob/5ff3fbe5fe1488310301e9d2ae56a9880d1ddfb2/src/zarr/abc/codec.py#L115

  • decode: Iterable[tuple[CodecOutput | None, ArraySpec]]
  • encode: Iterable[tuple[CodecInput | None, ArraySpec]]
  • read: Iterable[tuple[ByteGetter, ArraySpec, SelectorTuple, SelectorTuple, bool]]
  • write: Iterable[tuple[ByteSetter, ArraySpec, SelectorTuple, SelectorTuple, bool]]

At the moment, we have no way to evolve the interface in a backwards compatible way. https://github.com/zarr-developers/zarr-python/discussions/2845 noted an accidental API break.

One option for gracefully evolving the spec here, which I might need for https://github.com/zarr-developers/zarr-python/issues/2904, is to replace the tuples with dataclasses. We can safely add new optional fields to the dataclass without breaking backwards compatibility.

We can define __len__ and __iter__ on the dataclasses and freeze their return values to the current API.

@dataclass(frozen=True, eq=True)
class DecodeChunksAndSpecs:
    codec_output: CodecOutput | None
    array_spec: ArraySpec

    def __len__(self): return 2
    def __iter__(self):
        yield self.codec_output
        yield self.array_spec

And potentially we would warn when accessing the fields through iteration or position, to encourage pipeline implementations to migrate to the new system.

Steps to reproduce

na

Additional output

No response

TomAugspurger avatar May 09 '25 13:05 TomAugspurger

Another issue is that CodecPipeline.evolve_from_array_spec is currently never called. We need the ArrayMetadata and ArrayConfig in zarrs-python to properly support a broader range of Zarr V2 arrays and configurations. Also, it would be very helpful if the array store could be passed to the CodecPipeline constructor.

Right now it looks like zarrs-python is the only public user of CodecPipeline. IMHO you should just break this API for zarr-python 3.1.

cc: @ilan-gold

LDeakin avatar May 31 '25 02:05 LDeakin