VirtualiZarr for incrementally-populated arrays
This issue discusses what VirtualiZarr would need to enable incrementally-populated zarr arrays.
Background: Incrementally populated arrays
In the original issue (https://github.com/zarr-developers/zarr-specs/issues/300), I described the pattern of populating Zarr arrays incrementally and on-demand, chunk-by-chunk.
I then described data structures we use to track and reason about initialized chunks with a potential future open-source solution.
A typical question the user might ask is "in this area, defined by physical coordinates or a geometry, what chunks were initialized?".
There's a significant overlap between the responsibilities of MetaArrays and VirtualiZarr.
MetaArray example implementation
In DahnJ/metaarrays, I have provided an example implementation of constructing metaarrays from Icechunk's chunk_coordinates.
See walkthrough.ipynb for a walkthrough of the API.
Integration with VirtualiZarr
The following is a discussion of features of VirtualiZarr that would be needed to support metaarray-like functionality.
Read Icechunk manifests
VirtualiZarr needs to be able to read existing icechunk manifests. The issue https://github.com/earth-mover/icechunk/issues/104 tracks this.
There's also a question around performance of reading all required information for ManifestArray, especially as compared to simply reading the initialized indexes through chunk_coordinates.
Subsetting in the label space
There is discussion (https://github.com/zarr-developers/VirtualiZarr/issues/51) and an implementation in https://github.com/zarr-developers/VirtualiZarr/pull/499 of being able to slice along chunk boundaries.
I still haven't looked into this in detail. How VirtualiZarr and MetaArrays represent the indexes is arguably the greatest difference
- VirtualiZarr indexes individual pixels in label space
- MetaArrays index chunks in label space
It's a question of whether the limitation of slicing only along chunk boundaries can be lifted. The user should be able to simply ask for any chunks intersecting the query's area and ideally also be able to provide a polygon as a query.
Optimization: Reading a subset
In https://github.com/earth-mover/icechunk/issues/401 it is argued that 100 million chunks should cover everyone's needs.
However, the number of chunks in our datasets frequently goes into billions (largest is 600B), with at most 100s of millions of initialized chunks.
This might seem excessive, but incrementally-populated arrays bring a natural push towards small chunks. This is because chunk is the smallest writable unit – otherwise we would end up with partially-initialized chunks, which would make it difficult to reason about which data has been populated.
Furthermore, for certain usecases, such as visualisation, it is preferable to have small chunks.
Monthly time-series of high-resolution multi-band data can thus easily start getting into billion of chunks.
Optimization: Requests a subset of the manifest
Reading all manifests when we only need a subset is wasteful. This would have to be a feature of Icechunk and the manifest data format.
This is perhaps not necessary in the case of chunk_coordinates, which can fetch 10s of millions of initialized chunks in seconds. It might however be necessary if we wanted to load datasets with billions+ of (mostly unitialized) chunks into the ManifestArray.
Discussion
I would love to see MetaArrays dissolve into existing FOSS tooling.
However, it's not clear that VirtualiZarr and MetaArrays overlap sufficiently. Right now, it seems to me that MetaArrays would have to be a whole new duck array implementation alongside ManifestArray, including it's own reading method using chunk_coordinates. Perhaps that's stretching the responsibility of VirtualiZarr too far. I will still look into VirtualiZarr more to get a better idea about that.
A big question for me is to what extent any of this is useful to others. Perhaps MetaArrays are too tailored for our specific usecase and only subsets of the functionality should be open-sourced.
The Mapper class (see walkthrough.ipynb) could be a good example. However, it is not clear to me where it would live. Since it deals with label space, it belongs at the level of Xarray rather than Zarr itself. Or perhaps a translation between pixel and chunk indices could live in Zarr itself, making the implementation of MetaArrays easier whilst potentially being useful to other usecases.
Thanks for these thoughts @DahnJ ! I'm still processing them (and I think I need to read them a few more times) but I want to get clear on the different coordinate spaces we are talking about first.
In VirtualiZarr today we deal with 3 distinct "spaces". Imagine a 100x100 array with chunks of size 10x10.
-
- Chunk coordinate space - this is the "grid of chunks", whose indices are zarr chunk keys or IC "chunk coordinates". This space has size
10x10.
- Chunk coordinate space - this is the "grid of chunks", whose indices are zarr chunk keys or IC "chunk coordinates". This space has size
-
- Array element space - this is the actual raw array of values, of size
100x100, whose indices are the ijk you would use numpy array indexing on.
- Array element space - this is the actual raw array of values, of size
-
- Xarray label space - this is whatever space
.sel()allows you to express your queries in, conventionally using the values of another array of size100x100- a "coordinate label" array.
- Xarray label space - this is whatever space
Xarray as a package gives you a way to get from (iii) to (ii) via .sel. Xarray is ignorant of (i) but is capable of wrapping array types that internally store (i) (e.g. dask.Array, ManifestArray). VirtualiZarr provides some very basic ways to get from (ii) to (i), and particularly allows the user to perform concatenation in (ii)-space via an implementation that translates into concatenations in (i)-space.
Q's:
- Do these 3 spaces map onto your table? If so what is your 4th entry?
- What operations do you want to do, in terms of these spaces?
- Do you just want to make queries in (iii) and get results in (i)? Do you want to invert that?
I guess your 4th entry is "chunk physical coordinates", which seems to be the locations of {centroids / boundaries / ...} of chunks (i) in label space (iii).
From https://earthmover-community.slack.com/archives/C07NQCBSTB7/p1747515390511479:
So the array of values of chunk positions in label space are effectively a coordinate array for a different set of data (the data of chunk locations, which have nothing to do with actual science data). I'm unclear why you need to jump straight from (i) to (iii) like this. Note how similar this is to xarray's "regions" kwarg to .to_zarr...
@mpiannucci pointed out that the HRRR people do something vaguely similar: https://mesowest.utah.edu/html/hrrr/zarr_documentation/html/python_data_loading.html
First of all, thanks for giving this the time and attention! I'm increasingly realizing that the work we do around incrementally-populated zarr arrays is novel/surprising and that I should do a better job at explaining it.
I have provided a new readme in DahnJ/metaarrays that attempts to succinctly motivate the metaarray data structure.
Answers
I guess your 4th entry is "chunk physical coordinates", which seems to be the locations of {centroids / boundaries / ...} of chunks (i) in label space (iii).
Your understanding of the four spaces is spot on.
Let's think about this from the perspective of the user of a partially-populated Zarr array. Let's say they want to "read data over Columbia", but they are not sure what data is initialized and want to find out.
The user naturally thinks in physical coordinates (number iii. in your list). We then need to translate this information into the chunk index space (i.) Metaarrays are what allows us to go from the physical coordinates to the chunk index space.
Why do we need the chunk centroids (iv.), then? This is because existing tooling understands elements in xarray dataarrays as rectangular pixels with the coordinates referring to the pixel centroids. By constructing the metarray with chunk centroids as coordinates, we get a data structure that conforms to this standard. Then all the tooling just works.
For example, doing metarray.plot() now correctly plots the chunks in the physical space.
We can store the metarray in a file and open it e.g. in QGIS to visualize the chunks over a map.
And finally, using rioxarray, our user's question becomes simply
metaarray.rio.clip([columbia_polygon], all_touched=True)
HRRR
@mpiannucci pointed out that the HRRR people do something vaguely similar: https://mesowest.utah.edu/html/hrrr/zarr_documentation/html/python_data_loading.html
If I understood correctly, then that HRRR datasets are many different zarr arrays that are gridded & chunked equally to enable the user to make it easier to extract e.g. time-series at a single geographic location, despite the chunks coming from different zarr arrays.
This does actually exemplify an important difference in how we use Zarr. I haven't fully explored the data, but hopefully I'm not far off in saying that in our data modelling approach, we would model the HRRR data as a single massive datacube where run_hour, level_type, var_level etc. would become individual dimensions.
If I'm correct, then the entire data loading tutorial & tooling wouldn't be necessary. The user could simply xr.open_zarr().sel(). There already exists an amazing system for chunk indexing – Zarr itself.
The more difficult part would then be writing into such datacube as new data comes in and discovering which data is available. That is – precisely what metarrays enable.