xcdat
xcdat copied to clipboard
[Doc]: xCDAT Practical Parallel Notebook
Describe your documentation update
This issue arises from the existing parallel computing guide that is meant to provide some general guidance on parallel computing with dask/xarray/xcdat.
It would be helpful to create some specific documentation on how to parallelize in the xcdat context. The ideas would be to show more xcdat-oriented practical examples (without the complications of the dask cluster stuff). I imagine it would do things like this:
- Download a dataset that is a few GBs to disk (e.g., a piControl file via wget)
- Show how chunking in time-versus-space affects performance for a given operation (e.g., spatial averaging probably needs time-chunks, but temporal averaging might do fine with space chunks?)
- Walk through how you might decide on a chunk size (e.g., this is a 4 GB dataset with 100 years or 1200 timepoints, so breaking it into decades [120 months], would give me pretty manageable 400 MB chunks to work on with 5 workers).
- Maybe show a
dask.delayed
(or joblib) example, e.g.,- First download a number of netcdfs (e.g., a CMIP historical tas simulation broken into ~12 files)
- Do a glob to get the file list
- Create a function that opens a file, computes the spatial average, returns the spatial average
- run
results = dask.delayed(...)
- Use
xr.concat
to combine the results into one dataset - Compare
dask.delayed
to serial performance
- Talk about some xCDAT-specific parallelization considerations (e.g., the FAQs in the existing parallel notebook)