zarr-python
zarr-python copied to clipboard
Iteration over Zarr arrays is significantly slower in v3
Zarr version
3.0.0b2
Numcodecs version
0.14.0
Python Version
3.11
Operating System
Mac
Installation
pip
Description
Iterating over elements or slices of a Zarr array z using iter(z) is much slower in v3 than v2.
In v2 Array implements __iter__, which caches chunks, whereas in v3 Array does not implement __iter__ so iter(z) falls back to calling z[0], z[1], etc, which means that every element or slice has to load the chunk again.
This seems like a regression, but perhaps there was a reason for not implementing __iter__ on Array? The thing is that code will still work in v3, but will be a lot slower (possibly orders of magnitude), and it's quite hard to track down what is happening.
(There's an interesting Array API issue about iterating over arrays which may be relevant here: https://github.com/data-apis/array-api/issues/818.)
Steps to reproduce
See https://github.com/sgkit-dev/bio2zarr/pull/288#issuecomment-2514346824
Additional output
No response
I think we just haven't implemented __iter__ in v3, I don't recall any explicit opposition to it.
I do think this kind of iteration is intrinsically inefficient for most zarr arrays, because it doesn't take any account of the chunk structure. Even if __iter__ caches a single chunk at a time, it might fetch + decompress the same chunk multiple times. To minimize IO, __iter__ should cache all the chunks that intersect with the spatial index, but this is not safe in general for large arrays.
The most efficient iteration would be to iterate over chunks, then iterate over each chunk. This of course is not possible if we choose to emulate numpy-style iteration.
I do think this kind of iteration is intrinsically inefficient for most zarr arrays, because it doesn't take any account of the chunk structure. Even if iter caches a single chunk at a time, it might fetch + decompress the same chunk multiple times.
I don't follow this - if it iterates over the chunks in the first dimension, then that's about as good as you can do, given the definition of the operation? Something like
def iter(z):
for chunk in range(z.cdata_shape[0]):
A = z.blocks[chunk]
for a in A:
yield a
(Just FYI, I added the iter functionality to zarr 2 and find it a useful operation)
Correct me if I'm misunderstanding something -- suppose we have an array a with shape (8, 7) and chunks sized (7, 1). a has 8 total chunks. a[0] will return an array with length 7 comprised of the 0th element of all 8 chunks, and similarly for a[1] etc. If during iteration only 1 chunk is cached at a time, then the each chunk must be read + decompressed each time a[idx] is invoked. Of course if a had chunks sized (1, 7), then caching each chunk during iteration would be quite efficient.
The basic problem is that the IO-optimal way of iterating over zarr arrays requires knowledge of how they are chunked, but the behavior of __iter__ we will implement is fundementally a numpy convention that assumes that fetching data from memory is cheap, and so it is completely ignorant of the chunk layout, which is fine for numpy because numpy arrays are in memory and people don't need to worry too much about the memory layout. The latter does not hold for zarr.
(edit I had transposed my efficient and inefficient chunks)
to illustrate with some examples from zarr==2.18:
this requires reading every chunk per iteration:
>>> [x for x in zarr.zeros((3,2), chunks=(3,1))]
[array([0., 0.]), array([0., 0.]), array([0., 0.])]
this requires reading 1 chunk per iteration:
>>> [x for x in zarr.zeros((3,2), chunks=(1,2))]
[array([0., 0.]), array([0., 0.]), array([0., 0.])]
for isotropic chunk sizes, some caching is better than none, but this kind of iteration will still be inefficient compared to a chunk-aware iteration API
Sorry, I don't get it (I must be missing something basic here!)
def first_dim_iter(z):
for chunk in range(z.cdata_shape[0]):
A = z.blocks[chunk]
for a in A:
yield a
Surely the numpy array A here is the cached result of each chunk, in turn, of the first dimension of z. It makes no difference how the subsequent dimensions are chunked, right? And, given that you must keep the first dimension cached here to have anything like reasonable performance, then this is basically optimal (assuming you want to iterate over every "row" in the first dimension).
So this is chunk aware, right?
Sorry, I don't get it (I must be missing something basic here!)
def first_dim_iter(z): for chunk in range(z.cdata_shape[0]): A = z.blocks[chunk] for a in A: yield a
Surely the numpy array
Ahere is the cached result of each chunk, in turn, of the first dimension of z. It makes no difference how the subsequent dimensions are chunked, right? And, given that you must keep the first dimension cached here to have anything like reasonable performance, then this is basically optimal (assuming you want to iterate over every "row" in the first dimension).So this is chunk aware, right?
That function will potentially load an entire array into memory (see the first case), which we do not want:
# /// script
# requires-python = ">=3.11"
# dependencies = [
# "zarr==2.18",
# ]
# ///
import zarr
import numpy as np
def first_dim_iter(z):
for chunk in range(z.cdata_shape[0]):
A = z.blocks[chunk]
print('A has shape ', A.shape)
for a in A:
yield a
shape = (8,7)
chunks = ((8,1), (1, 7), (3,3))
for _chunks in chunks:
print('chunks', _chunks)
z = zarr.zeros(shape=shape, chunks=_chunks)
z[:] = np.arange(np.prod(shape)).reshape(*shape)
gen = first_dim_iter(z)
print(next(gen))
print(next(gen))
produces
chunks (8, 1)
A has shape (8, 7)
[0. 1. 2. 3. 4. 5. 6.]
[ 7. 8. 9. 10. 11. 12. 13.]
chunks (1, 7)
A has shape (1, 7)
[0. 1. 2. 3. 4. 5. 6.]
A has shape (1, 7)
[ 7. 8. 9. 10. 11. 12. 13.]
chunks (3, 3)
A has shape (3, 7)
[0. 1. 2. 3. 4. 5. 6.]
[ 7. 8. 9. 10. 11. 12. 13.]
it looks like the implementation of Array.iter in v2 had the same behavior -- in the worst case of misalignment between the chunks and the iteration, the entire array might be loaded into memory.
Should we add this to v3 and treat it like a "user beware" situation where people should know how the array is chunked before they iterate over it? Personally, I would be unpleasantly surprised if I tried to iterate of a chunked array and incidentally fetched all the chunks (but I work with TB sized data.). I would generally assume that a "big data first" library would keep surprises like that to a minimum.
More broadly, I'm not a fan of copying numpy APIs like __iter__ that implicitly rely on arrays existing in memory, since Zarr arrays are very much not implicitly in memory.
That function will potentially load an entire array into memory (see the first case), which we do not want:
Sorry I'm sure I'm being thick here, but I'm still not getting it. You could only load the entire array into memory if there was only one chunk in the first dimension, right? Having a single chunk in the first dimension of a large array seems quite an odd situation to me? I'm used to working with arrays with a large first dimension, with many chunks, and this pattern works well.
I work with TB scale data as well and I entirely agree that unpleasant surprises like loading an entire array into memory behind the scenes shouldn't happen. But surprises like
for a in z:
# do something with a
decompressing each chunk, chunksize times are also not entirely pleasant.
Having a single chunk in the first dimension of a large array seems quite an odd situation to me?
Let me assure you that this is extremely common. Zarr users exercise every possible allowed permutation of chunk shapes to satisfy different data access patterns. There is nothing special about the first dimension in any way.
Thanks @rabernat, that's why I'm confused then!
But in this one chunk case, isn't the whole array getting loaded into memory anyway each time we access a single element, during iteration?
But in this one chunk case, isn't the whole array getting loaded into memory anyway each time we access a single element, during iteration?
In the worst case, the entire array is loaded into memory just once (on the first invocation of next(array.__iter__), and for each subsequent iteration that in-memory copy of the array is iterated over. In the best case, each chunk is loaded just once, exactly when it's needed.
I think the following example illustrates it. I'm using a subclass of MemoryStore that prints whenever __getitem__ is invoked to indicate where IO would be happening:
# /// script
# requires-python = ">=3.10"
# dependencies = [
# "zarr==2.18",
# ]
# ///
import zarr
import numpy as np
class LoudStore(zarr.MemoryStore):
def __getitem__(self, key):
print(f"reading {key}")
return super().__getitem__(key)
shape = (8,7)
chunks = ((shape[0], 1), (1, shape[1]))
for _chunks in chunks:
print(f'\nchunks: {_chunks}')
z = zarr.zeros(shape=shape, chunks=_chunks, store=LoudStore())
z[:] = np.arange(z.size).reshape(shape)
gen = z.islice()
for it in range(2):
print(f'iter {it}:')
print(next(gen))
which produces the following output:
chunks: (8, 1)
reading .zarray
iter 0:
reading 0.0
reading 0.1
reading 0.2
reading 0.3
reading 0.4
reading 0.5
reading 0.6
[0. 1. 2. 3. 4. 5. 6.]
iter 1:
[ 7. 8. 9. 10. 11. 12. 13.]
chunks: (1, 7)
reading .zarray
iter 0:
reading 0.0
[0. 1. 2. 3. 4. 5. 6.]
iter 1:
reading 1.0
[ 7. 8. 9. 10. 11. 12. 13.]
@d-v-b in v3 chunks are not cached though, so it's much slower, which is why I opened this issue. This is the worse of both worlds as iteration works, but is terribly inefficient.
I think there's a case for implementing __iter__ for 1-D arrays since that is unambiguous, but raising an exception for higher-dimensional arrays (see https://github.com/data-apis/array-api/pull/821). There could perhaps be some utility methods for iterating over Zarr arrays, like Jerome's first_dim_iter or similar.
If you care about the order in which elements come out this iteration is only cheap is if there is a single chunk along the last dimension (for order='C') or the first dimension (for order='F').
Anything else will require some handling of potentially large slices & complicated chunk caches whose behaviour depends on the chunking of the array
(I like this image from the dask docs that illustrates the problem).
So perhaps we bring it back but just add the chunksize checks to avoid potentially large memory usage.
It is always possible to just iterate through blocks, and flatten that and yield elements as long as you do not care about order. Dask just added this recently and it can be quite useful.
So perhaps we bring it back but just add the chunksize checks to avoid potentially large memory usage.
That would be good.