Array indexing with Zarr 3 is noticeably slower than with Zarr 2
Zarr version
3.1.3
Numcodecs version
0.15.1
Python Version
3.12.9
Operating System
Linux
Installation
uv pip install
Description
Timing comparisons between Zarr 2 and Zarr 3 of various indexing show that version 3 is always slower than version 2. I have attached the code to run the comparisons, and some results. All tests are run in memory and the compressors are disabled.
You can see that for example that accessing data[::step] is much slower (from 0.5s to 3.9s) . It may be due to the overhead of switching between synchronous and asynchronous code, but this is just a guess.
Steps to reproduce
import zarr
import numpy as np
import timeit
import inspect
version = int(zarr.__version__.split(".")[0])
values = np.ones(shape=(60632, 4, 1, 10))
root = zarr.group()
if version < 3:
data = root.create_dataset("data", data=values, shape=values.shape, compressor=None)
else:
data = root.create_array("data", data=values, compressors=None)
start = 15
end = data.shape[0] - 15
step = data.shape[0] // 10
def set_values():
data[:] = 2
tests = [
lambda: data[0:10, :, 0],
lambda: data[:, 0:3, 0],
lambda: data[0:10, 0:3, 0],
lambda: data[:, :, :],
lambda: data[0],
lambda: data[0, :],
lambda: data[0, 0, :],
lambda: data[0, 0, 0, :],
lambda: data[start:end:step],
lambda: data[start:end],
lambda: data[start:],
lambda: data[:end],
lambda: data[::step],
lambda: set_values(),
]
for i, t in enumerate(tests):
src = inspect.getsourcelines(t)[0][0].strip().replace("lambda:", "").strip(",")
elapsed = timeit.timeit(t, number=1000)
print(f"zarr{version}: {src:22}: {elapsed:10.4f} seconds")
Additional output
Running the test above with Zarr 2.18.7:
zarr2: data[0:10, :, 0] : 0.1546 seconds
zarr2: data[:, 0:3, 0] : 4.1608 seconds
zarr2: data[0:10, 0:3, 0] : 0.1267 seconds
zarr2: data[:, :, :] : 6.2516 seconds
zarr2: data[0] : 0.1478 seconds
zarr2: data[0, :] : 0.1492 seconds
zarr2: data[0, 0, :] : 0.0602 seconds
zarr2: data[0, 0, 0, :] : 0.0676 seconds
zarr2: data[start:end:step] : 0.5197 seconds
zarr2: data[start:end] : 6.2125 seconds
zarr2: data[start:] : 6.2291 seconds
zarr2: data[:end] : 6.2185 seconds
zarr2: data[::step] : 0.5151 seconds
zarr2: set_values() : 3.2621 seconds
and with Zarr 3.1.3:
zarr3: data[0:10, :, 0] : 1.1350 seconds
zarr3: data[:, 0:3, 0] : 7.1892 seconds
zarr3: data[0:10, 0:3, 0] : 0.9352 seconds
zarr3: data[:, :, :] : 10.4327 seconds
zarr3: data[0] : 1.1225 seconds
zarr3: data[0, :] : 1.1331 seconds
zarr3: data[0, 0, :] : 0.5101 seconds
zarr3: data[0, 0, 0, :] : 0.5125 seconds
zarr3: data[start:end:step] : 3.8887 seconds
zarr3: data[start:end] : 10.4090 seconds
zarr3: data[start:] : 10.4113 seconds
zarr3: data[:end] : 10.4148 seconds
zarr3: data[::step] : 3.9100 seconds
zarr3: set_values() : 17.0404 seconds
thanks for this report! Our array indexing codebase is long overdue for performance tuning, and I think your examples prove this decisively. A lot of the Zarr developers are currently busy at the Zarr summit, and we are working on features that might touch on some of this code. Hopefully we can find some fixes soon.
Thanks a lot for the useful benchmark @b8raoult!
It's worth pointing out that the example here is highly artificial and exacerbates known issues in Zarr Python 3. Specifically:
- It uses no compression. This triggers a zero-copy optimization in 2 which is not available in 3 (see https://github.com/zarr-developers/zarr-python/issues/2904)
- It uses an in-memory store
With the following changes
root = zarr.group(store="local.zarr") # use a local directory store
if version < 3:
data = root.create_dataset("data", data=values, shape=values.shape) # use default compression
else:
data = root.create_array("data", data=values)
the results are [mostly] inverted
Zarr 2.17.1
zarr2: data[0:10, :, 0] : 5.5634 seconds
zarr2: data[:, 0:3, 0] : 20.2953 seconds
zarr2: data[0:10, 0:3, 0] : 4.1979 seconds
zarr2: data[:, :, :] : 28.9471 seconds
zarr2: data[0] : 5.5672 seconds
zarr2: data[0, :] : 5.5954 seconds
zarr2: data[0, 0, :] : 1.4001 seconds
zarr2: data[0, 0, 0, :] : 1.5567 seconds
zarr2: data[start:end:step] : 25.4365 seconds
zarr2: data[start:end] : 29.5792 seconds
zarr2: data[start:] : 30.7365 seconds
zarr2: data[:end] : 30.3044 seconds
zarr2: data[::step] : 22.9712 seconds
zarr2: set_values() : 22.1773 seconds
Zarr 3.1.3
zarr3: data[0:10, :, 0] : 2.7159 seconds
zarr3: data[:, 0:3, 0] : 13.5270 seconds
zarr3: data[0:10, 0:3, 0] : 4.8881 seconds
zarr3: data[:, :, :] : 17.2104 seconds
zarr3: data[0] : 2.8128 seconds
zarr3: data[0, :] : 2.9664 seconds
zarr3: data[0, 0, :] : 1.1589 seconds
zarr3: data[0, 0, 0, :] : 1.1362 seconds
zarr3: data[start:end:step] : 9.9953 seconds
zarr3: data[start:end] : 14.6972 seconds
zarr3: data[start:] : 16.0432 seconds
zarr3: data[:end] : 15.9281 seconds
zarr3: data[::step] : 10.1128 seconds
zarr3: set_values() : 29.4887 seconds
This more realistic benchmark shows the clear performance improvements that have been delivered by Zarr Python 3 for realistic use cases.
We should still definitely continue to work to improve the indexing in Zarr Python.
Thanks for the info. It is indeed an artificial example which we use to test indexing in anemoi-datasets. Anemoi's indexing virtually combine several Zarr (join, subset, etc.) and will internally call Zarr's indexing. The aim of these tests is to check our code, not Zarr's. I simply noticed that our tests took longer to complete with version 3, so I investigated. As you mentioned @rabernat , real examples will use compression and a file or bucket store.
You may close this issue if you wish.
These are absolutely real issues we should fix! I just wanted to share the counter-example so people don't freak out too much when they see this issue. 😁
Anemoi's indexing virtually combine several Zarr (join, subset, etc.) and will internally call Zarr's indexing
This is interesting to us independently of this issue. Xarray has some support for such virtual combinations, and we have considered breaking this out into a standalone package (see https://github.com/pydata/xarray/issues/5081). Would love to learn more about your use case.
Our use case is to allow researchers to combine data from several datasets for training:
- Sub-setting (in time).
- Combining (in time, variables, ensemble and grids).
- Selecting variables.
- Selecting grid points.
Examples:
- Train on 80% of the dataset, validate on the other 20%.
- Create a "stretch-grid" dataset by combining a global model and a high-res local area model (see cutout).
- Replace one or more variables (e.g. SST) in a dataset by one from another dataset.
- Apply grid thinning for rapid model prototyping.
- Creating an ensemble from several datasets.
In a nutshell:
- We create several Zarr datasets with what we think is best for training a forecast.
- Researchers run experiments, and mix and match various sources.
- We create new Zarr datasets based on their feedback.
At present, a typical Zarr is 0.5 TB, (65k dates, 100 variables, 1 member, 40k grid points), corresponding to 45 years of ERA5 6-hourly, at N320 resolution. We have other datasets with more dates (e.g. 45 years hourly), more variables (e.g. model levels) or more grid points (up to O2560). The largest one so far is around 80 TB. Compression is between 30% and 50%. Chunking is one chuck per date when possible (not for large grids, because of the 2 GB limit of some codecs, that will be lifted when using Zarr 3, if I understand correctly).
Xarray is great at these sorts of selection / merging / pre-processing operations. Is there a reason you can't use or at least build on Xarray here? This blog post is a little old but explains how many people are using Xarray and Zarr in AI/ML training workflows: https://earthmover.io/blog/cloud-native-dataloader
We use Zarr directly. Also our datasets are not organised according to the CF conventions. We want all variables for one date in the same chunk, not one array per variable, i.e. one single I/O for the whole state of the atmosphere for a given time. The idea is to have a chunk representing exactly the tensor that we provide to the model during training. You can see these datasets as pre-processed data, a la TFRecord. We do use Xarray to create the datasets though.
We want all variables for one date in the same chunk, not one array per variable... The idea is to have a chunk representing exactly the tensor that we provide to the model during training.
That is a smart approach. That's also the approach we took when optimizing AIFS inference in the cloud. However, that's certainly not mutually exclusive with using Xarray. Xarray absolutely does not require that data conform to CF conventions.
To use Xarray in this way, you would use something like variable as a dimension name, e.g.
import xarray as xr
import numpy as np
ds = xr.DataArray(np.ones((6, 180, 360)), dims=['variable', 'lat', 'lon'], coords={'variable': ['a', 'b', 'c', 'd', 'e', 'f']}, name="all_data").to_dataset()
<xarray.Dataset>
Dimensions: (variable: 6, lat: 180, lon: 360)
Coordinates:
* variable (variable) <U1 'a' 'b' 'c' 'd' 'e' 'f'
Dimensions without coordinates: lat, lon
Data variables:
all_data (variable, lat, lon) float64 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0
Here's a more realistic example for AIFS initial conditions: https://github.com/earth-mover/aifs-demo/blob/c5eecf90de25f036f1528a87843f2cd0fca94e0b/main.py#L91-L102. You can also browse this dataset online here: https://app.earthmover.io/earthmover-public/aifs-initial-conditions (free but login required).
Within the current Python AI / weather community, Xarray is where many of the requirements you listed above are currently implemented. I'd encourage you to give Xarray a closer look, as at may be more sustainable for you to build on Xarray rather than rebuilding these capabilities in a new package.