VirtualiZarr icon indicating copy to clipboard operation
VirtualiZarr copied to clipboard

open_virtual_mfdataset

Open TomNicholas opened this issue 1 year ago • 2 comments

Should we add a top-level function open_virtual_mfdataset?

Though I would like to be more confident about the best way to parallelize reference generation first

https://github.com/zarr-developers/VirtualiZarr/issues/123

https://github.com/zarr-developers/VirtualiZarr/issues/7

TomNicholas avatar Dec 12 '24 20:12 TomNicholas

Looking at the implementation of xr.open_mfdataset, I'm becoming optimistic that we can parallelize reference generation for concatenation along any number of dimensions just using dask.delayed or lithops.map. Basically we just use the same machinery xarray does for organising the open_dataset calls into a 1d list, optionally map over that in parallel, return the (small) virtual datasets back to the client over the network, then the multidimensional concatenation occurs on the client.

TomNicholas avatar Dec 15 '24 21:12 TomNicholas

An implementation of open_mfdataset was added in #349, but I won't close this yet because there are still some todos:

  • [x] Test that the lithops / dask parallelization actually works for non-local executors (see https://github.com/zarr-developers/VirtualiZarr/pull/349#discussion_r2019902793)
  • [x] Test that this whole approach actually scales the way I'm hoping it will (see #123)
  • [x] Add function to public API
  • [x] Add documentation of how to use the function in the user guide
  • [ ] Potentially fix the lithops bug upstream so we don't have to dodge it (see https://github.com/zarr-developers/VirtualiZarr/pull/349#discussion_r2013060076)
  • [ ] Potentially upstream parts of the code I wrote in #349 into xarray

TomNicholas avatar Mar 29 '25 18:03 TomNicholas

With #590 I think this can be closed. There are a couple of small known bugs but the function exists, is documented, and works.

TomNicholas avatar Jun 26 '25 02:06 TomNicholas