zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

v3 stores: Implement efficient get/set_partial_values

Open jstriebel opened this issue 2 years ago • 2 comments

In #1096 get/set_partial_values methods were introduced to Zarr v3 stores. The provided method is a viable fallback for stores that cannot read and write partial objects. Other stores however should implement optimized methods, such as fsspec-based stores (using read_block). It might be useful that stores indicate if they have fast partial read/write methods, so that strategies such as partial decompression can be selected automatically.

As a follow-up, the new get/set_partial_values methods could be used for the actual partial decompression in the PartialReadBuffer, instead of the current store-specific implementation.

Follow-up to #1096

jstriebel avatar Aug 03 '22 14:08 jstriebel

I believe that the most important use case for this is actually uncompressed arrays! That's a much simpler code path and reads no partial-reader (also happens to be the only one important to me for now).

How are you proposing that get_partial_buffer should be called? At the moment in (v2) Array._get_selection we iterate over the selections for each chunk, so we have the information right before handing off to the store.

martindurant avatar Aug 03 '22 14:08 martindurant

I believe that the most important use case for this is actually uncompressed arrays! That's a much simpler code path and reads no partial-reader (also happens to be the only one important to me for now).

Indeed, that's a great use-case!

How are you proposing that get_partial_buffer should be called? At the moment in (v2) Array._get_selection we iterate over the selections for each chunk, so we have the information right before handing off to the store.

I'll try to dump my thoughts about them:

I guess there are at least two ways:

  • Simply store the partial data as-is and pass it on,
  • use sth. like the PartialReadBuffer or extend it, so that the partial data has a similar interface as the whole chunk.

To solve this more holistically, the compressor (or a dummy for uncompressed arrays) should be able to tell if it can decode partial data, and have some interface for "demanding" data. In the uncompressed use-case, the requested array indices can directly be translated to chunk offsets, but in the blosc or other cases with an index, the decoder might need to read data in several passes (e.g. first getting some index, then getting the actual data, based on the index). For such cases, the PartialReadBuffer is a nice abstraction that allows to reload data in several passes, depending on the decoder. If the pattern is always to maybe get some data upfront for a chunk, and then the decoder can translate indices to offsets, this might be also be a viable option.

PS: First, we still need to implement efficient get/set_partial_values for stores where this is possible, to gain anything from it.

jstriebel avatar Aug 04 '22 21:08 jstriebel