zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

Add easy way to iterate through every chunk of an array

Open dstansby opened this issue 1 year ago • 12 comments

I am finding myself wanting to apply an operation (e.g., thresholding) on every chunk in an array. It would therefore be nice if there was an easy way to iterate through each chunk of the array.

Something like:

for i, chunk in enumerate(zarr_arr.all_chunks):
    zarr.all_chunks[i] = np.clip(chunk, min, max)

This is sort of similar to block indexing, but iterating over Array.blocks only iterates over the first axis of blocks, and gives you back all the chunks along the other axes.

dstansby avatar Oct 30 '24 17:10 dstansby

something like this? https://github.com/zarr-developers/zarr-python/blob/4c3081c9b4678d1de438e0d2ee8e0fa368a61f17/src/zarr/core/array.py#L893?

d-v-b avatar Oct 30 '24 17:10 d-v-b

Hard to tell just looking at the code, but from the docstring I think that sounds about right 😄

dstansby avatar Oct 30 '24 17:10 dstansby

Can we remake .blocks.__iter__ to be more like np.nditer instead?

dcherian avatar Oct 30 '24 18:10 dcherian

I think that would be quite a big braking API change, so probably worth creating a freshly named property or method?

dstansby avatar Oct 30 '24 18:10 dstansby

i'm not a big fan of .blocks to be honest, i don't think it's a very intuitive API for something as simple as "access data chunk by chunk". i'm not saying we should remove it, but we should take the 2 -> 3 transition as an opportunity to think about whether there's a better way to do things

d-v-b avatar Oct 31 '24 11:10 d-v-b

I will add a different but overlapping use-case that I have run into a few times.

We want to get random subsets of the data perfectly aligned to chunks. At the moment, we have to either use low-level APIs from zarr to get full blocks in parallel or we could create an integer index that looks something like 0, 1, ..., chunk_size, chunk_size * N, (chunk_size * N) + 1, ..., (chunk_size * (N+1)) - 1, ... along the relevant axis but this somehow feels wrong and/or inefficient.

I could see a public API like async get_chunk(key) that could then also be used by an iterator, something like async iter(order).

I'd be happy to contribute if there were interest. zarrs has this functionality and it's quite handy: https://docs.rs/zarrs/latest/zarrs/array/struct.Array.html#method.retrieve_chunk

ilan-gold avatar May 22 '25 13:05 ilan-gold

@ilan-gold I think we already have these as private array methods:

would it be helpful of these were made public API?

d-v-b avatar May 22 '25 13:05 d-v-b

oh wow @d-v-b let me look at these. I'll pass these on to the people who are currently working on this. I will report back our experience and we can go from there.

ilan-gold avatar May 22 '25 13:05 ilan-gold

we are missing something like get_chunk(array, (0,0,0)) -> array-like

d-v-b avatar May 22 '25 13:05 d-v-b

Right @d-v-b that is what I'm looking for, to be able to take advantage of pipeline-level parallelism over a selection of different chunks without having to dive into the async API. At the moment we create BlockIndexer objects for every chunk individually and then gather the results.

At the moment we have something like:

for i, idx in enumerate(chunk_idxs):
            chunk_slice = self.chunks_slices[idx]
            array_idx = self.array_idxs[idx]
            array = self.arrays[array_idx]
            indexer = BasicIndexer(
                chunk_slice,
                shape=array.metadata.shape,
                chunk_grid=array.metadata.chunk_grid,
            )
            # this can also be a gpu buffer
            buffer = prototype.nd_buffer.from_numpy_array(out[local_slices[i]])
            task = array._async_array._get_selection(
                indexer, prototype=prototype, out=buffer
            )
            tasks.append(task)

which, modulo using public APIs i.e., opening an async Array from the start, is about what I'd expect to be able to achieve this result without creating a giant integer array to do the indexing

ilan-gold avatar May 22 '25 14:05 ilan-gold

Saving this Zulip comment from @d-v-b so it's linked here:

we now have methods defined on the Array class for iterating over chunk-shaped regions, shard-shaped regions, and arbitrary-shaped regular regions. these are private right now but they could be made public if they prove useful

I think making these public would solve this issue

dstansby avatar Sep 26 '25 15:09 dstansby

yes I think so. The simplest thing would be to turn these methods into stand-alone functions that take an Array / AsyncArray as the first argument, then re-use those functions for the methods

d-v-b avatar Sep 26 '25 15:09 d-v-b