VirtualiZarr icon indicating copy to clipboard operation
VirtualiZarr copied to clipboard

Build virtual Zarr store using Xarray's dataset.to_zarr(region="...") model

Open maxrjones opened this issue 1 year ago • 8 comments

I'm wondering how we can build virtualization pipelines as a map rather than a map-reduce process. The map paradigm would follow the structure below from Earthmover's excellent blog post on serverless datacube pipelines, where the virtual dataset from each worker would get written directly to an Icechunk Virtual Store rather than being transferred back to the coordination node for a concatenation step. Similar to their post, the parallelization between workers during virtualization could leverage a serverless approach like lithops or coiled. Dask concurrency could speed up virtualization within each worker. I think the main feature needed for this to work is an analogue to xr.Dataset.to_zarr(region="..."), complementary to xr.Dataset.to_zarr(append_dim="...")(documented in https://docs.xarray.dev/en/latest/generated/xarray.Dataset.to_zarr.html) (xref https://github.com/zarr-developers/VirtualiZarr/issues/21, https://github.com/zarr-developers/VirtualiZarr/pull/272).

Image

I tried to check that this feature request doesn't already exist, but apologies if I missed something.

maxrjones avatar Nov 19 '24 17:11 maxrjones

This is a really extremely good question that I have also thought about a bit.

I think a big difference is that this "map" approach assumes more knowledge the structure of the dataset up front. To use zarr regions you have to know how big the regions are in advance to create an empty zarr array that will be filles, but In the "map-reduce" approach you are able to concatenate files whose shapes you don't know until you open them. Whether that actually matters much in practice is another question.

One thing to note is that IIUC having each worker write a region of virtual references would mean one icechunk commit per region, whereas transferring everything back to one node means one commit per logical version of the dataset, which seems a little nicer.

I think the main feature needed for this to work is an analogue to xr.Dataset.to_zarr(region="...")

I agree.

TomNicholas avatar Nov 19 '24 17:11 TomNicholas

I really like the idea behind this approach @maxrjones! Being able to skip transferring all the references back to a 'reduce' step seem great. Also wondering about how all the commits between workers would work with icechunk.

norlandrhagen avatar Nov 19 '24 17:11 norlandrhagen

This also seems like a good way to protect against a single failed reference killing the entire job.

norlandrhagen avatar Nov 19 '24 17:11 norlandrhagen

I think a big difference is that this "map" approach assumes more knowledge the structure of the dataset up front.

:+1:

IIUC having each worker write a region of virtual references would mean one icechunk commit per region,

It does not, once https://github.com/earth-mover/icechunk/pull/357 is in. Though in practice, that also does map-reduce on changesets which will contain virtual references in this case...

dcherian avatar Nov 19 '24 17:11 dcherian

IIUC having each worker write a region of virtual references would mean one icechunk commit per region,

It does not, once https://github.com/earth-mover/icechunk/pull/357 is in. Though in practice, that also does map-reduce on changesets which will contain virtual references in this case...

Thanks for this information! Do you think it's accurate that map-reduce on changesets would still be more performant than map-reduce on virtual datasets when operating on thousands or more of files, such that this feature would be helpful despite not truly by a map operation? Also, do you anticipate there being other prerequisites in addition to your PR before VirtualiZarr could have something like dataset_to_icechunk(region="...")?

maxrjones avatar Nov 19 '24 22:11 maxrjones

At the zarr summit @jbbarre showed us an interesting use case like this.

His use case is a bunch of netCDF files which do basically[^1] tesselate, but covering only the coastal areas of Antarctica.

This is an example of a use case where you do need to use a priori information about where to write the references into a larger array.

For this we need to implement the region kwarg in .vz.to_icechunk().

[^1]: @jbbarre the details of the overlap could matter a lot here. It's possible that your whole workload won't work because of the overlap. Feel free to raise a separate issue/discussion about these details of your dataset if you want. But the regularly-gridded + no-overlap case could definitely be supported if we had a region kwarg.

[^2]: In theory we could also implement it in .vz.to_kerchunk(), but that would require in-place updates to kerchunk JSON/Parquet files, which might cause all sorts of consistency and parallel access problems.

TomNicholas avatar Oct 22 '25 20:10 TomNicholas

@TomNicholas, Thank you for your recommendations and the interest in this matter. I'll think of a preproc step to manage the overlap. Here is the cubes grid for clarity

Image Image

jbbarre avatar Oct 23 '25 06:10 jbbarre

I've been thinking that a "CropCodec" might work for uniform overlaps.

Here is an example of a TIFF dataset with no overlaps if anyone is interested: https://icechunk.io/en/latest/ingestion/glad-ingest/

dcherian avatar Oct 24 '25 13:10 dcherian