xcdat [Doc]: xCDAT Practical Parallel Notebook

[Doc]: xCDAT Practical Parallel Notebook

Open pochedls opened this issue 8 months ago • 0 comments

Describe your documentation update

This issue arises from the existing parallel computing guide that is meant to provide some general guidance on parallel computing with dask/xarray/xcdat.

It would be helpful to create some specific documentation on how to parallelize in the xcdat context. The ideas would be to show more xcdat-oriented practical examples (without the complications of the dask cluster stuff). I imagine it would do things like this:

Download a dataset that is a few GBs to disk (e.g., a piControl file via wget)
Show how chunking in time-versus-space affects performance for a given operation (e.g., spatial averaging probably needs time-chunks, but temporal averaging might do fine with space chunks?)
Walk through how you might decide on a chunk size (e.g., this is a 4 GB dataset with 100 years or 1200 timepoints, so breaking it into decades [120 months], would give me pretty manageable 400 MB chunks to work on with 5 workers).
Maybe show a dask.delayed (or joblib) example, e.g.,
- First download a number of netcdfs (e.g., a CMIP historical tas simulation broken into ~12 files)
- Do a glob to get the file list
- Create a function that opens a file, computes the spatial average, returns the spatial average
- run results = dask.delayed(...)
- Use xr.concat to combine the results into one dataset
- Compare dask.delayed to serial performance
Talk about some xCDAT-specific parallelization considerations (e.g., the FAQs in the existing parallel notebook)

Jun 08 '24 18:06 pochedls

xcdat xcdat copied to clipboard

[Doc]: xCDAT Practical Parallel Notebook

Describe your documentation update

xcdat
xcdat copied to clipboard