VirtualiZarr Performance roadmap

We want to be able to create, combine, serialize, and use manifests that point to very large numbers of files. The largest Zarr stores we already see have O(1e6) chunks per array, and 10's or 100's of arrays (e.g. the National Water Model dataset).

To be able to virtualize this many references at once, there are multiple places that could become performance bottlenecks.

Reference generation open_virtual_dataset should create ManifestArray objects from URLs with a minimum memory overhead. It is unlikely that the current implementation (i.e. kerchunk's SingleHdf5ToZarr to find generate references for a netCDF file) we are using is as efficient as possible because it first creates a python dict rather than a more efficient in-memory representation.
In-memory representation It's crucial that our in-memory representation of the manifest be as memory-efficient as possible. What's the smallest that the in-memory representation of e.g. a single chunk manifest containing a million references could be? I think it's only ~100MB, if we use a numpy array (or 3 numpy arrays) and avoid object dtypes. We should be able to test this easily ourselves by creating a manifest with some dummy references data. See #33.
Combining If we can fit all the ManifestArray objects we need into memory on one worker, this part is easy. Again using numpy arrays is good because np.concatenate/stack should be memory-efficient.

If the combined manifest is too big to fit into a single worker's memory then we might want to use dask to create a task graph, with the root tasks generating the references, concatenation via a tree of tasks (not technically a reduction because the output is as big as the input), then writing out chunkwise to some form of storage. Overall the result would be similar to how kerchunk uses auto_dask.
Serialization Writing out the manifest to JSON as text would create an on-disk representation that is larger than necessary. Writing out to a compressed / chunked format such as parquet could ameliorate this. This could be kerchunk's parquet format or we could use a compressed format in the Zarr manifest storage transformer ZEP (see discussion in https://github.com/zarr-developers/zarr-specs/issues/287)

We should be trying to work out what the performance of each of these steps is, and they are separate so we can look at them individually.

cc @sharkinsspatial @norlandrhagen @keewis

May 09 '24 04:05 TomNicholas

To demonstrate my point about (2), it looks like storing a million references in-memory can be done using numpy taking up only 24MB.

In [1]: import numpy as np

# Using numpy 2.0 for the variable-length string dtype
In [2]: np.__version__
Out[2]: '2.0.0rc1'

# The number of chunks in this zarr array
In [3]: N_ENTRIES = 1000000

In [4]: SHAPE = (100, 100, 100)

# Taken from Julius' example in issue #93 
In [5]: url_prefix = 's3://esgf-world/CMIP6/CMIP/CCCma/CanESM5/historical/r10i1p1f1/Omon/uo/gn/v20190429/uo_Omon_CanESM5_historical_r10i1p1f1_gn_185001-'

# Notice these strings will have slightly different lengths
In [6]: paths = [f"{url_prefix}{i}.nc" for i in range(N_ENTRIES)]

# Here's where we need numpy 2
In [7]: paths_arr = np.array(paths, dtype=np.dtypes.StringDType).reshape(SHAPE)

# These are the dimensions of our chunk grid
In [8]: paths_arr.shape
Out[8]: (100, 100, 100)

In [9]: paths_arr.nbytes
Out[9]: 16000000

In [10]: offsets_arr = np.arange(N_ENTRIES, dtype=np.dtype('int32')).reshape(SHAPE)

In [11]: offsets_arr.nbytes
Out[11]: 4000000

In [12]: lengths_arr = np.repeat(100, repeats=N_ENTRIES).astype('int32').reshape(SHAPE)

In [13]: lengths_arr.nbytes
Out[14]: 4000000

In [14]: (paths_arr.nbytes + offsets_arr.nbytes + lengths_arr.nbytes) / 1e6
Out[14]: 24.0

i.e. only 24MB to store a million references (for one ManifestArray) in memory.

This is all we need for (2) I think. It would be nice to put all 3 fields into one structured array instead of 3 separate numpy arrays, but apparently that isn't possible yet (see Jeremy's comment https://github.com/zarr-developers/zarr-specs/issues/287#issuecomment-2096636240). But that's just an implementation detail as this would all be hidden within the ManifestArray class anyway. So it looks like to complete (2) I just have to finish #39 but using 3 numpy arrays instead of one structured array.

24MB per array means that even a really big store with 100 variables, each with a million chunks, still only takes up 2.4GB in memory - i.e. your xarray "virtual" dataset would be ~2.4GB to represent the entire store. This is smaller than worker memory, implying that I don't think we need dask to perform the concatenation (so we shouldn't need dask for (3)?).

Next thing we should look at is how much space these references would take up on-disk as JSON/parquet/special zarr arrays.

May 09 '24 16:05 TomNicholas

Next thing we should look at is how much space these references would take up on-disk as JSON/parquet/special zarr arrays.

See Ryan's comment (https://github.com/TomNicholas/VirtualiZarr/issues/33#issuecomment-1998639128) suggesting specific compressor codecs to use for storing references in zarr arrays.

May 09 '24 18:05 TomNicholas

even a really big store with 100 variables, each with a million chunks

I tried this out, see https://github.com/earth-mover/icechunk/issues/401

Nov 20 '24 20:11 TomNicholas

This write-up of trying VirtualiZarr + Icechunk on a large dataset (IMERG) is also illuminating (see https://github.com/earth-mover/icechunk-nasa/pull/1 for context)

tl;dr: icechunk needs to reduce the size / split up its manifests

Jan 09 '25 15:01 TomNicholas