zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

Draft: jigsaw codecs: allow codecs to specify the buffers they work on

Open keewis opened this issue 2 months ago • 1 comments

For my work on the sparse codec (and after discussing with @d-v-b and @jhamman at the zarr summit), I've noticed that it should be possible to have the codecs declare their input and output buffer types. The codec pipeline can then verify that the codecs form a chain of buffer types (kind of like jigsaw puzzle pieces), and infer the codec pipeline's buffer prototype as the input of the first array-to-array codec and the output of the last bytes-to-bytes codec.

TODO:

  • [ ] Add unit tests and/or doctests in docstrings
  • [ ] Add docstrings and API docs for any new/modified user-facing classes and functions
  • [ ] New/modified features documented in docs/user-guide/*.md
  • [ ] Changes documented as a new file in changes/
  • [ ] GitHub Actions have all passed
  • [ ] Test coverage is 100% (Codecov passes)

keewis avatar Oct 16 '25 22:10 keewis

it just dawned to me that we can potentially split up the sparse codec (which is a array-to-bytes codec) into a array-to-array codec that extracts the metadata and component arrays of the sparse array and creates to specialized "multi-array buffer" for sparse arrays, and a generalized array-to-bytes codec that takes the "multi-array buffer" and packs it into bytes. This obviously means that the metadata we extracted has to live in the array-to-array codec's configuration.

Then should we want a similar procedure for a different array type (e.g. masked arrays or geoarrow-encoded geometry arrays), we can just create a specialized pair of array-to-array codec and "multi-array buffer" type, and reuse the "multi-array to bytes" codec.

keewis avatar Oct 16 '25 22:10 keewis