arviz
arviz copied to clipboard
ENH: Enable `xr.open_mfdataset` like functionality
Tell us about it
There are some use-cases where a model can be decomposed into a bunch of parallel runs -- for instance breaking up a high dimensional dataset into a bunch of univariate datasets, where each dimension is fitted independently.
When fitting these models via MCMC or what not, the resulting model could generate many az.InferenceData
objects.
In my own use cases, this could be anywhere from hundreds to tens of thousands of dimensions.
If the standard xr.concatenate
was used to combine these models, it will could take days to merge all of these models -- even if dask was enabled. There is a workaround to load all of the az.InferenceData
objects into memory and rechunk everything into dask, but it is non-trivial.
Thoughts on implementation
It could be useful if xr.open_mfdataset
-like functionality was available to merge together many az.InferenceData
objects from many files. Furthermore, there is out-of-box support in xr.open_mfdataset
to use dask, which help better leverage parallelism when reading these files into memory.
The actual implementation
If I had to guestimate what this would look like (based on the xarray source), it would be something as follows
def open_mf_inference_data(inf_paths : List[Str], coords : dict, concatenate_name : str, group_kwargs : dict, parallel=True) -> az.InferenceData:
if parallel:
open_f = dask.delayed(az.from_netcdf)
else:
open_f = az.from_netcdf
inf_list = [open_f(x, group_kwargs=group_kwargs) for x in inf_files]
combined = concatenate_inferences(inf_list, coords, concatenate_name)
return combined
where concatenate_inferences
is already defined here. This implementation already works with hundreds of files, but this approach breaks down with tens of thousands of files. I have been able to get an implementation working with tens of thousands of files. However it is quite a hack, requiring subsetting the problem into smaller manageable chunks.
Relevant posts https://github.com/arviz-devs/arviz/pull/1749
https://github.com/pydata/xarray/discussions/5657
https://github.com/gibsramen/BIRDMAn/issues/57
Relevant papers that leverage this type of approach https://science.sciencemag.org/content/364/6435/89.abstract https://www.biorxiv.org/content/10.1101/757096v1
I imagine this will become increasingly common when dealing with high dimensional biological data.
@OriolAbril, do you anticipate that this sort of functionality would be useful to have in arviz?
CC @gibsramen
Hi, just interesting take.
We have a az.concat
function, but it is not really meant for anything this big, but to consider something like this, then I think this new functionality should work against az.concat(dim="chain")
<- (or draw
). I'm not sure should it work against groups?
I wonder if there is a way to create final xarray object without creating many intermediate steps (xarray objects).