xcdat icon indicating copy to clipboard operation
xcdat copied to clipboard

[Doc]: xCDAT Practical Parallel Notebook

Open pochedls opened this issue 8 months ago • 0 comments

Describe your documentation update

This issue arises from the existing parallel computing guide that is meant to provide some general guidance on parallel computing with dask/xarray/xcdat.

It would be helpful to create some specific documentation on how to parallelize in the xcdat context. The ideas would be to show more xcdat-oriented practical examples (without the complications of the dask cluster stuff). I imagine it would do things like this:

  • Download a dataset that is a few GBs to disk (e.g., a piControl file via wget)
  • Show how chunking in time-versus-space affects performance for a given operation (e.g., spatial averaging probably needs time-chunks, but temporal averaging might do fine with space chunks?)
  • Walk through how you might decide on a chunk size (e.g., this is a 4 GB dataset with 100 years or 1200 timepoints, so breaking it into decades [120 months], would give me pretty manageable 400 MB chunks to work on with 5 workers).
  • Maybe show a dask.delayed (or joblib) example, e.g.,
    • First download a number of netcdfs (e.g., a CMIP historical tas simulation broken into ~12 files)
    • Do a glob to get the file list
    • Create a function that opens a file, computes the spatial average, returns the spatial average
    • run results = dask.delayed(...)
    • Use xr.concat to combine the results into one dataset
    • Compare dask.delayed to serial performance
  • Talk about some xCDAT-specific parallelization considerations (e.g., the FAQs in the existing parallel notebook)

pochedls avatar Jun 08 '24 18:06 pochedls