open_virtual_mfdataset
Should we add a top-level function open_virtual_mfdataset?
Though I would like to be more confident about the best way to parallelize reference generation first
https://github.com/zarr-developers/VirtualiZarr/issues/123
https://github.com/zarr-developers/VirtualiZarr/issues/7
Looking at the implementation of xr.open_mfdataset, I'm becoming optimistic that we can parallelize reference generation for concatenation along any number of dimensions just using dask.delayed or lithops.map. Basically we just use the same machinery xarray does for organising the open_dataset calls into a 1d list, optionally map over that in parallel, return the (small) virtual datasets back to the client over the network, then the multidimensional concatenation occurs on the client.
An implementation of open_mfdataset was added in #349, but I won't close this yet because there are still some todos:
- [x] Test that the lithops / dask parallelization actually works for non-local executors (see https://github.com/zarr-developers/VirtualiZarr/pull/349#discussion_r2019902793)
- [x] Test that this whole approach actually scales the way I'm hoping it will (see #123)
- [x] Add function to public API
- [x] Add documentation of how to use the function in the user guide
- [ ] Potentially fix the lithops bug upstream so we don't have to dodge it (see https://github.com/zarr-developers/VirtualiZarr/pull/349#discussion_r2013060076)
- [ ] Potentially upstream parts of the code I wrote in #349 into xarray
With #590 I think this can be closed. There are a couple of small known bugs but the function exists, is documented, and works.