arviz icon indicating copy to clipboard operation
arviz copied to clipboard

ENH: Enable `xr.open_mfdataset` like functionality

Open mortonjt opened this issue 2 years ago • 1 comments

Tell us about it

There are some use-cases where a model can be decomposed into a bunch of parallel runs -- for instance breaking up a high dimensional dataset into a bunch of univariate datasets, where each dimension is fitted independently.

When fitting these models via MCMC or what not, the resulting model could generate many az.InferenceData objects. In my own use cases, this could be anywhere from hundreds to tens of thousands of dimensions. If the standard xr.concatenate was used to combine these models, it will could take days to merge all of these models -- even if dask was enabled. There is a workaround to load all of the az.InferenceData objects into memory and rechunk everything into dask, but it is non-trivial.

Thoughts on implementation

It could be useful if xr.open_mfdataset-like functionality was available to merge together many az.InferenceData objects from many files. Furthermore, there is out-of-box support in xr.open_mfdataset to use dask, which help better leverage parallelism when reading these files into memory.

The actual implementation

If I had to guestimate what this would look like (based on the xarray source), it would be something as follows

def open_mf_inference_data(inf_paths : List[Str], coords : dict, concatenate_name : str,  group_kwargs : dict, parallel=True) -> az.InferenceData:
    if parallel:
         open_f = dask.delayed(az.from_netcdf)
    else:
         open_f = az.from_netcdf 
    inf_list = [open_f(x, group_kwargs=group_kwargs) for x in inf_files]
    combined = concatenate_inferences(inf_list, coords, concatenate_name)
    return combined

where concatenate_inferences is already defined here. This implementation already works with hundreds of files, but this approach breaks down with tens of thousands of files. I have been able to get an implementation working with tens of thousands of files. However it is quite a hack, requiring subsetting the problem into smaller manageable chunks.

Relevant posts https://github.com/arviz-devs/arviz/pull/1749

https://github.com/pydata/xarray/discussions/5657

https://github.com/gibsramen/BIRDMAn/issues/57

Relevant papers that leverage this type of approach https://science.sciencemag.org/content/364/6435/89.abstract https://www.biorxiv.org/content/10.1101/757096v1

I imagine this will become increasingly common when dealing with high dimensional biological data.

@OriolAbril, do you anticipate that this sort of functionality would be useful to have in arviz?

CC @gibsramen

mortonjt avatar Jul 31 '21 21:07 mortonjt

Hi, just interesting take.

We have a az.concat function, but it is not really meant for anything this big, but to consider something like this, then I think this new functionality should work against az.concat(dim="chain") <- (or draw). I'm not sure should it work against groups?

I wonder if there is a way to create final xarray object without creating many intermediate steps (xarray objects).

ahartikainen avatar Aug 17 '21 06:08 ahartikainen