Add easy way to iterate through every chunk of an array
I am finding myself wanting to apply an operation (e.g., thresholding) on every chunk in an array. It would therefore be nice if there was an easy way to iterate through each chunk of the array.
Something like:
for i, chunk in enumerate(zarr_arr.all_chunks):
zarr.all_chunks[i] = np.clip(chunk, min, max)
This is sort of similar to block indexing, but iterating over Array.blocks only iterates over the first axis of blocks, and gives you back all the chunks along the other axes.
something like this? https://github.com/zarr-developers/zarr-python/blob/4c3081c9b4678d1de438e0d2ee8e0fa368a61f17/src/zarr/core/array.py#L893?
Hard to tell just looking at the code, but from the docstring I think that sounds about right 😄
Can we remake .blocks.__iter__ to be more like np.nditer instead?
I think that would be quite a big braking API change, so probably worth creating a freshly named property or method?
i'm not a big fan of .blocks to be honest, i don't think it's a very intuitive API for something as simple as "access data chunk by chunk". i'm not saying we should remove it, but we should take the 2 -> 3 transition as an opportunity to think about whether there's a better way to do things
I will add a different but overlapping use-case that I have run into a few times.
We want to get random subsets of the data perfectly aligned to chunks. At the moment, we have to either use low-level APIs from zarr to get full blocks in parallel or we could create an integer index that looks something like 0, 1, ..., chunk_size, chunk_size * N, (chunk_size * N) + 1, ..., (chunk_size * (N+1)) - 1, ... along the relevant axis but this somehow feels wrong and/or inefficient.
I could see a public API like async get_chunk(key) that could then also be used by an iterator, something like async iter(order).
I'd be happy to contribute if there were interest. zarrs has this functionality and it's quite handy: https://docs.rs/zarrs/latest/zarrs/array/struct.Array.html#method.retrieve_chunk
@ilan-gold I think we already have these as private array methods:
would it be helpful of these were made public API?
oh wow @d-v-b let me look at these. I'll pass these on to the people who are currently working on this. I will report back our experience and we can go from there.
we are missing something like get_chunk(array, (0,0,0)) -> array-like
Right @d-v-b that is what I'm looking for, to be able to take advantage of pipeline-level parallelism over a selection of different chunks without having to dive into the async API. At the moment we create BlockIndexer objects for every chunk individually and then gather the results.
At the moment we have something like:
for i, idx in enumerate(chunk_idxs):
chunk_slice = self.chunks_slices[idx]
array_idx = self.array_idxs[idx]
array = self.arrays[array_idx]
indexer = BasicIndexer(
chunk_slice,
shape=array.metadata.shape,
chunk_grid=array.metadata.chunk_grid,
)
# this can also be a gpu buffer
buffer = prototype.nd_buffer.from_numpy_array(out[local_slices[i]])
task = array._async_array._get_selection(
indexer, prototype=prototype, out=buffer
)
tasks.append(task)
which, modulo using public APIs i.e., opening an async Array from the start, is about what I'd expect to be able to achieve this result without creating a giant integer array to do the indexing
Saving this Zulip comment from @d-v-b so it's linked here:
we now have methods defined on the Array class for iterating over chunk-shaped regions, shard-shaped regions, and arbitrary-shaped regular regions. these are private right now but they could be made public if they prove useful
I think making these public would solve this issue
yes I think so. The simplest thing would be to turn these methods into stand-alone functions that take an Array / AsyncArray as the first argument, then re-use those functions for the methods