Performance regression in V3
Zarr version
3.0.0
Numcodecs version
0.14.1
Python Version
3.13
Operating System
Linux
Installation
Using uv
Description
This simple workload, which writes out the numbers 1 through 1e9 in 64 separate chunks,
# Using zarr==2.18.2
import numpy as np
import zarr
from zarr._storage.v3 import DirectoryStoreV3
store = DirectoryStoreV3("/tmp/foo.zarr")
arr = zarr.array(np.arange(1024 * 1024 * 1024, dtype=np.float64), chunks=(1024 * 1024 * 16,))
zarr.save_array(store, arr, zarr_version=3, path="/")
run in about 5s on my machine on version 2.18.
The equivalent workload on version 3 takes over a minute:
# Using zarr==3.0
import numpy as np
import zarr
import zarr.codecs
from zarr.storage import LocalStore
store = LocalStore("/tmp/bar.zarr")
compressors = zarr.codecs.BloscCodec(cname='lz4', shuffle=zarr.codecs.BloscShuffle.bitshuffle)
za = zarr.create_array(
store,
shape=(1024 * 1024 * 1024,),
chunks=(1024 * 1024 * 16,),
dtype=np.float64,
compressors=compressors,
)
arr = np.arange(1024 * 1024 * 1024, dtype=np.float64)
za[:] = arr
Steps to reproduce
See above
Additional output
No response
ouch, sorry the performance is so much worse in 3.0. I am pretty sure we can do better, hopefully this is an easy fix. In case you or anyone else wants to dig into this, i would profile this function
FWIW, I tried and failed to reproduce this regression locally. In fact, V3 was faster.
2.18.2
arr = zarr.array(np.arange(1024 * 1024 * 1024, dtype=np.float64), chunks=(1024 * 1024 * 16,))
# -> 8.0s
zarr.save_array(store, arr, zarr_version=3, path="/")
# -> 5.3s
3.0.1.dev10+g45146ca0'
arr = np.arange(1024 * 1024 * 1024, dtype=np.float64)
# -> 1.2s
za[:] = arr
# -> 9.9s
I wonder what could be the difference between environments. Perhaps the regression is hardware-dependent. I'm on a macbook with a fast SSD.
Hmm, I don't know what your first 8s measurement was. It should not take that long to allocate some chunk buffers in memory. I also have an MacBook Pro M3 Max so will rerun these and report back. The initial set of measurements I took was on an AWS EC2 m6i instance.
UPDATE:
My latencies also look much better on MacOS, albeit Zarr V3 is still measuring slower.
Zarr 2.18.4
❯ ZARR_V3_EXPERIMENTAL_API=1 ipython
Python 3.13.1 (main, Jan 5 2025, 06:22:40) [Clang 19.1.6 ]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.31.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import numpy as np
In [2]: import zarr
In [3]: from zarr._storage.v3 import DirectoryStoreV3
In [4]: %time arr = zarr.array(np.arange(1024 * 1024 * 1024, dtype=np.float64), chunks=(1024 * 1024 * 16,))
CPU times: user 3.29 s, sys: 525 ms, total: 3.82 s
Wall time: 1.25 s
In [5]: store = DirectoryStoreV3("/tmp/foo.zarr")
In [6]: %time zarr.save_array(store, arr, zarr_version=3, path="/")
CPU times: user 7.83 s, sys: 543 ms, total: 8.37 s
Wall time: 1.25 s
Zarr 3.0.0
In [5]: store = LocalStore("/tmp/bar.zarr")
In [6]: compressors = zarr.codecs.BloscCodec(cname="lz4", shuffle=zarr.codecs.BloscShuffle.bitshuffle)
In [8]: za = zarr.create_array(store, shape=(1024 * 1024 * 1024,), chunks=(1024 * 1024 * 16,), dtype=np.float64, compressors=compressors)
In [9]: arr = np.arange(1024 * 1024 * 1024, dtype=np.float64)
In [10]: %time za[:] = arr
CPU times: user 6.75 s, sys: 1.38 s, total: 8.13 s
Wall time: 3.69 s
I'll keep digging...
My current hypothesis is that the benchmarking that I've been running on the EC2 instance (r7i.2xlarge) is memory-bandwidth constrained (AWS is very hand wavy about memory bandwidth throttling). Even just allocating large arrays, e.g. np.ones(1024 ** 3) takes a few seconds (whereas on bare metal it would be in the hundreds of milliseconds).
However, because https://github.com/zarr-developers/zarr-python/blob/main/src/zarr/core/codec_pipeline.py#L396 is run in a coroutine on the event loop (even though there is no I/O), it effectively crowds out other tasks from running. For the 128 MiB chunk, calling np.array_equal takes about 150ms (again probably due to lack of memory bandwidth). But for the 64 chunks, this constitutes a full 10s of serial work on the event loop.
a1 = np.arange(1024 ** 2 * 16)
a2 = np.ones(1024 **2 * 16)
# this 128 MiB memcmp takes 150ms on my r7i.2xlarge EC2 instance and 50ms on my M3 Max
np.array_equal(a1, a2)
If I short-circuit chunk_array.all_equal by setting it to False, then the rest of the pipeline finishes very quickly and I can write out the full array in ~3s, which brings me to parity with the performance I can get on Zarr 2.18.
The former implementation zarr.util.all_equal appears to compare the chunk to a scalar
return np.all(value == array)
which underneath the hood does some efficient SIMD (takes only 15ms on the same machine) whereas the version 3 implementation looks/is much more expensive.
Hi, do the maintainers have any thoughts on this? I am happy to take a look if there is a change that you are willing to accept, but I'd first need some context why the decision was made to change the implementation of all_equal.
@y4n9squared - thanks very much for your work on this. We are definitely interested in fixing this performance regression and appreciate your input.
@d-v-b is the one who wrote that code, so let's get his input on it. From my perspective, I am very open to changing the way this is implemented back the the way it used to be.
Hi @d-v-b , just so we have something concrete to discuss, I opened a draft PR with the changes that I made which showed improvement.
@y4n9squared thanks for the report, we are definitely happy to fix performance regressions here.
one question about the examples you shared upthread: In zarr-python 2.18, the default value for write_empty_chunks is True, but in 3.x it's False.
Because of this change in default behavior, benchmarking across zarr-python versions without explicitly setting write_empty_chunks might give you misleading results.
For the broader question of benchmarking in the context of throttled or low memory bandwidth, can you recommend any tools for artificially reducing memory bandwidth? I'd like to be able to benchmark these things locally.
@d-v-b is the one who wrote that code, so let's get his input on it. From my perspective, I am very open to changing the way this is implemented back the the way it used to be.
Contrary to what git blame suggests, I didn't write that implementation of all_equal -- it appeared in https://github.com/zarr-developers/zarr-python/pull/1826. In any case, we should definitely go with an implementation that has the best all-around performance.
@y4n9squared thanks for the report, we are definitely happy to fix performance regressions here.
one question about the examples you shared upthread: In zarr-python 2.18, the default value for
write_empty_chunksis True, but in 3.x it'sFalse.Because of this change in default behavior, benchmarking across zarr-python versions without explicitly setting
write_empty_chunksmight give you misleading results.
@d-v-b Thanks for the callout. I'll rerun the comparison so that it's apples-to-apples. Along these lines, has there been a change to the compressor behavior as well?
UUIC, in Zarr 2.x, if one did not override the compressor, the default setting was Blosc/LZ4 w/ shuffle, where the shuffle behavior (bit vs. byte) depended on the dtype -- for float64, it would do bit.
When I compare the compressed chunk sizes with this 3.x code:
store = "/tmp/foo.zarr"
shape = (1024 * 1024 * 1024,)
chunks = (1024 * 1024 * 16,)
dtype = np.float64
fill_value = np.nan
# cname = "blosclz"
cname = "lz4"
compressors = zarr.codecs.BloscCodec(cname=cname, shuffle=zarr.codecs.BloscShuffle.bitshuffle)
za = zarr.create_array(
store,
shape=shape,
chunks=chunks,
dtype=dtype,
fill_value=fill_value,
compressors=compressors,
)
arr = np.arange(1024 * 1024 * 1024, dtype=dtype).reshape(shape)
za[:] = arr
I see notable differences in the sizes. Curiously, if I switch the cname to blosclz instead, I get chunk sizes that are more similar to Zarr 2.x.
What is the correct compressor config for an accurate 2.x comparison?
For the broader question of benchmarking in the context of throttled or low memory bandwidth, can you recommend any tools for artificially reducing memory bandwidth? I'd like to be able to benchmark these things locally.
I'm not entirely sure. Aside from the general notion that EC2 instances are VMs and small instance types like the ones that I've been using (m6i.2xlarge, r7i.2xlarge) are more susceptible to "noisy neighbors", I haven't been able to pinpoint why I see the slowdowns that I do. Some of the normal tools that I'd use to inspect these things (e.g. GNU perf) are handicapped in VMs. I am going to look into this a bit more and revert back.
Thanks for iterating on this, this is exactly what we need to dial in the performance of the library.
I see notable differences in the sizes. Curiously, if I switch the cname to
blosclzinstead, I get chunk sizes that are more similar to Zarr 2.x.What is the correct compressor config for an accurate 2.x comparison?
For zarr v2 data, all of the zarr-python v2 codecs are available in zarr-python 3.0. You should be able to pass exactly the same numcodecs instance as compressors to create_array, which should make for a fair comparison. But maybe to keep things simple for a benchmark, no compression at all might make sense.
For zarr v2 data, all of the zarr-python v2 codecs are available in zarr-python 3.0. You should be able to pass exactly the same numcodecs instance as
compressorstocreate_array, which should make for a fair comparison. But maybe to keep things simple for a benchmark, no compression at all might make sense.
If I pass in numcodec types directly to create_array,
compressors = [numcodecs.Blosc(cname="lz4")]
I get this traceback:
File "/home/yang.yang/workspaces/zarr-python/src/zarr/core/array.py", line 3926, in create_array
array_array, array_bytes, bytes_bytes = _parse_chunk_encoding_v3(
~~~~~~~~~~~~~~~~~~~~~~~~^
compressors=compressors,
^^^^^^^^^^^^^^^^^^^^^^^^
...<2 lines>...
dtype=dtype_parsed,
^^^^^^^^^^^^^^^^^^^
)
^
File "/home/yang.yang/workspaces/zarr-python/src/zarr/core/array.py", line 4127, in _parse_chunk_encoding_v3
out_bytes_bytes = tuple(_parse_bytes_bytes_codec(c) for c in maybe_bytes_bytes)
File "/home/yang.yang/workspaces/zarr-python/src/zarr/core/array.py", line 4127, in <genexpr>
out_bytes_bytes = tuple(_parse_bytes_bytes_codec(c) for c in maybe_bytes_bytes)
~~~~~~~~~~~~~~~~~~~~~~~~^^^
File "/home/yang.yang/workspaces/zarr-python/src/zarr/registry.py", line 184, in _parse_bytes_bytes_codec
raise TypeError(f"Expected a BytesBytesCodec. Got {type(data)} instead.")
TypeError: Expected a BytesBytesCodec. Got <class 'numcodecs.blosc.Blosc'> instead.
ah, sorry for the confusion, I thought you were making zarr v2 data (i.e., you were calling create_array with zarr_format=2). As you learned, create_array with zarr_format=3 won't take the v2 numcodecs codecs (and we should make a better error message for this).
Do you know what causes the discrepancy in the compression results then? I expected that, even though the Python syntax has changed, the actual compression behavior/results would have been the same.
Hi @d-v-b, could you comment on my compressor question above? In summary, my expectation was that using
compressor = numcodecs.Blosc(cname="lz4", shuffle=BITSHUFFLE)
in 2.x would yield identical behavior as
compressors = zarr.codecs.Blosc(cname="lz4", shuffle=zarr.codecs.BloscShuffle.bitshuffle)
in 3.x, but this appears not to be the case. The compression ratio is actually off by quite a bit.
that sounds like a bug, but maybe it should be spun out into its own issue with a reproducer?
I created #2766 to track this issue separately.
Hi Zarr devs, with some of the recent bug fixes landed, I wanted to chase down some additional performance observations on our Zarr 2 -> 3 migration.
We have an Xarray data array (name/coords redacted)
<xarray.DataArray 'foo' (x: 108, y: 4932, z: 366)> Size: 2GB
dask.array<open_dataset-foo, shape=(108, 4932, 366), dtype=float64, chunksize=(54, 4932, 61), chunktype=numpy.ndarray>
which I have written using Zarr 2.18 (zarr_version=3 + ZARR_V3_EXPERIMENTAL_API=1) and Zarr 3.0.8 (zarr_format=3). Both are using default compressor settings so the former is Blosc/LZ4 and the new is Zstandard.
As advertised, the unchunked xr.DataArray.load() is significantly faster in Zarr 3 (presumably due to being able to fetch multiple chunks concurrently). The raw numbers were something like 900ms --> 600ms on my machine.
Somewhat surprisingly, I observed that the chunked .load() (using Dask threaded scheduler and Zarr aligned chunks) is much slower with a single worker. Raw numbers were around 1s --> 1.5s. Also surprisingly to me, increasing the worker count did not reduce the latency on Zarr 3 beyond unchunked load, while it did linearly in Zarr 2 up to a certain point (~4 workers for me).
Zarr 2 no dask: ~900ms Zarr 2 + 1 dask worker: ~1s Zarr 2 + 4 dask workers: ~600ms
Zarr 3 no dask: ~600ms Zarr 3 + 1 dask worker: ~1.5s Zarr 3 + 4 dask workers: ~900ms
(For the avoidance of doubt that this has something to do with Blosc/LZ4 vs. Zstd, I did the same set of experiments saving the Zarr 3 data using Blosc/LZ4 and it was roughly the same outcome).
I am happy to provide a reproducer, but first wanted to ask if any of these observations are surprising to you/things you have experienced.
a reproducer would be great, and actually we have heard similar reports about performance regression in zarr-python 3 -- see https://github.com/cgohlke/tifffile/issues/297#issuecomment-2905785157
Here are two Zarr groups containing a single array with the dimensions/chunking I described above:
- Zarr 2.18: foo.tar.gz
- Zarr 3.08: bar.tar.gz
Also attached are two flame graphs measuring this workload:
import dask.config
# import numcodecs
import xarray as xr
dask.config.set({"scheduler": "synchronous"})
# numcodecs.blosc.use_threads = False
for _ in range(10):
da = xr.load_dataarray("/tmp/bar.zarr", engine="zarr", consolidated=False, chunks={}, zarr_format=3)
The Blosc compressor in numcodecs auto-enables multithreaded decompression only if the thread ID is MainThread, so I ran with threads disabled for a "fair" comparison. Otherwise, running Dask with "synchronous" is faster than running it with "threading" with a single worker, which might be surprising to folks. And in Zarr 3, it would occur on the asyncio_0 thread (?) and therefore also not have threading enabled.
In Zarr 3, is the decompression occurring inside of as_numpy_array_wrapper? Does the Zstd decompression release the GIL? I'm wondering if, as a byproduct of Zarr offloading Store coroutines onto its own event loop, it actually causes CPU underutilization when being used with Dask worker threads. All of the Dask worker threads would be waiting on a sync() and if there was CPU-bound work that would otherwise be able to run in parallel, then multiple Dask workers may end up being a wash.
The flame graphs do not directly point to this theory, but look consistent with that hypothesis.