zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

Chunked compression indices

Open juntyr opened this issue 6 months ago • 6 comments

I'm working on a compressor where chunked encoding / decoding is possible, but during encoding we need to

  1. know the data indices as part of the entire dataset (essentially not just in the chunk but with the chunk's offset)
  2. access some data in nearby chunks (only during encode)

From my current understanding of the V3 Codec interface this doesn't seem to be supported, but I'd be very happy to be wrong. Perhaps the encoding API would also need to be provided the chunk slices, similar to the partial encoding API. Though I'm unsure how reading from nearby chunks could be supported unless the encode multiple API would always include all chunks.

Is there a way to support such a compressor in Zarr?

Thank you for your help!

juntyr avatar Jun 09 '25 05:06 juntyr

Related idea https://github.com/zarr-developers/zarr-specs/issues/346

LDeakin avatar Jun 09 '25 05:06 LDeakin

know the data indices as part of the entire dataset (essentially not just in the chunk but with the chunk's offset)

This is not possible today but I think some small changes to our array / codec API could get us closer.

The abstract routine for encoding a single chunk has this signature:

https://github.com/zarr-developers/zarr-python/blob/6193fd9b3bd23f0aa9676b489e62f224e4325b3e/src/zarr/abc/codec.py#L131-L134

if we added information about a chunk's location in the chunk grid to chunk_spec, then it would be possible for codecs to use this information when encoding.

But I think there are other complications with your idea -- making the encoding step depend on reading adjacent chunks makes the process of encoding the entire array order-dependent. If I understand this correctly, the result of writing a chunk depends on whether adjacent chunks have been written or are empty. To make writing an entire array deterministic, you need chunks to be written in a specific order, and we don't have abstractions in zarr-python right now to easily support this.

d-v-b avatar Jun 09 '25 06:06 d-v-b

know the data indices as part of the entire dataset (essentially not just in the chunk but with the chunk's offset)

This is not possible today but I think some small changes to our array / codec API could get us closer.

The abstract routine for encoding a single chunk has this signature:

zarr-python/src/zarr/abc/codec.py

Lines 131 to 134 in 6193fd9

async def _encode_single( self, chunk_data: CodecInput, chunk_spec: ArraySpec ) -> CodecOutput | None: raise NotImplementedError if we added information about a chunk's location in the chunk grid to chunk_spec, then it would be possible for codecs to use this information when encoding.

That would be a great help! What would the process be for making this change?

But I think there are other complications with your idea -- making the encoding step depend on reading adjacent chunks makes the process of encoding the entire array order-dependent. If I understand this correctly, the result of writing a chunk depends on whether adjacent chunks have been written or are empty. To make writing an entire array deterministic, you need chunks to be written in a specific order, and we don't have abstractions in zarr-python right now to easily support this.

The encoding would only depend on reading the original (array) data from adjacent chunks, i.e. it does not depend on the order of writing to disk. To give some context, my compressor needs to read a stencil of adjacent array elements during encoding. So instead of just receiving the CodecInput, it would also need a way to read the CodecInput of adjacent chunks.

juntyr avatar Jun 09 '25 06:06 juntyr

That would be a great help! What would the process be for making this change?

There would need to be a codec or some other routine that could actually use this information. Right now, I don't think any of the codecs defined in the library would use the offset, but it's possible that there are other locations where adding offset information to the chunk spec would simplify things, for example in our indexing routines.

The encoding would only depend on reading the original (array) data from adjacent chunks, i.e. it does not depend on the order of writing to disk. To give some context, my compressor needs to read a stencil of adjacent array elements during encoding. So instead of just receiving the CodecInput, it would also need a way to read the CodecInput of adjacent chunks.

Ok this makes much more sense than what I had imagined. For this to work, zarr-python would need to know that this codec consumes chunks that are larger than the chunk size of the array reported in metadata. I'm not sure how best to encode this information, but it would be interesting to see some concrete ideas for it.

d-v-b avatar Jun 09 '25 07:06 d-v-b

That would probably be an even better approach. A codec could report that it needs additional stencil elements in different dimensions. Codecs that come before in the pipeline could then either perform duplicate work on the stencils (probably easier) or coordinate to provide the stencil values. The codecs would still be expected to only produce encoded data for the area excluding the stencil. All existing codecs would assume an all-zero stencil, but other codecs could ask for a larger one.

juntyr avatar Jun 09 '25 07:06 juntyr

In my usecase, the compressor needs to analyse relationships between neighbouring elements. If we don’t use a stencil, there will be artefacts at the chunk boundaries. Once I can share the details on my exact use case (still unpublished research), I’ll reach out again with the specifics. Then it would probably be easier to see if we can come up with a way to support the use case.

In any case, providing info on the chunk offsets would be useful in generally (not least to warn about boundary artefacts), so perhaps I could help implement providing this information in the meantime?

juntyr avatar Jun 09 '25 08:06 juntyr