cosima-recipes icon indicating copy to clipboard operation
cosima-recipes copied to clipboard

Test/document how recent dask improvements affect particularly sticky calculations

Open dougiesquire opened this issue 1 year ago • 10 comments

Dask 2022.11.0 includes changes to the way tasks are scheduled that has improved/made viable many large geoscience workflows. Some details here and here.

It could be interesting/helpful to test how this change has impacted any particularly complex/slow/problematic cosima recipes.

dougiesquire avatar Jan 12 '23 01:01 dougiesquire

Good idea, and great news that memory scalability has improved. This has always been the sticking point for me in large calculations, and it would be good to have a recipe providing advice on large calculations.

Dask is now up to v. 2023.1.0. What version are we using in the latest Conda environment?

aekiss avatar Jan 23 '23 01:01 aekiss

analysis3-unstable has 2022.12.1 and analysis has 2022.11.1. I'll probably set up a couple of envs specific for this task - one with the latest dask, and one pre-2022.11.0

dougiesquire avatar Jan 23 '23 02:01 dougiesquire

I'd love to hear from COSIMA folk whether there are any particular calculations/workflows that have historically been difficult to complete

dougiesquire avatar Jan 23 '23 02:01 dougiesquire

These notebooks (I think) were flagged as potential candidates:

  • https://github.com/COSIMA/cosima-recipes/blob/master/DocumentedExamples/Binning_transformation_from_depth_to_potential_density.ipynb
  • https://github.com/COSIMA/cosima-recipes/blob/master/DocumentedExamples/Zonally_Averaged_Global_Meridional_Overturning_Circulation.ipynb
  • https://github.com/COSIMA/cosima-recipes/blob/master/DocumentedExamples/Decomposing_kinetic_energy_into_mean_and_transient.ipynb

Some of the above examples use the cosima_cookbook function compute_by_blocks. It may be possible to remove this with the new version of dask.

dougiesquire avatar Jan 23 '23 23:01 dougiesquire

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/cosima-hackathon-v2-0-tuesday-january-24th-2023/307/40

access-hive-bot avatar Jan 23 '23 23:01 access-hive-bot

I chose to focus on the Decomposing_kinetic_energy_into_mean_and_transient example, which heavily uses the compute_by_blocks function. Replacing compute_by_blocks with regular compute causes dask workers to start spilling to disk and the notebook dies before completion (using 7 CPU, 32 GB). This is true even for the very simple TKE calculation.

It turns out that the grid of the ocean_daily_3d_u_*.nc files used in this notebook changes depending on which output directory you're looking at:

  • files in directories output196-output279 are on a regional domain with yu_ocean: 900, xu_ocean: 3600
  • files in directories output740-output799 are on a global domain with yu_ocean: 2700, xu_ocean: 3600

Currently, in this notebook, the cosima-cookbook tries to whack these grids together when loading, which causes big issues for downstream analysis. I'll open a separate issue about this and link it soon (ADDED: link).

If we change the notebook to only load the global data and compute the TKE:

  • with dask 2022.10.2 : dask workers to spill to disk and the notebook dies before completion
  • with dask 2022.11.0 : memory is well managed and the calculation completes

So, the scheduling update means that this notebook can now be run without using the compute_by_blocks workaround.

dougiesquire avatar Jan 24 '23 05:01 dougiesquire

Test environments and a stripped back version of Decomposing_kinetic_energy_into_mean_and_transient.ipynb used for testing are here

dougiesquire avatar Jan 24 '23 23:01 dougiesquire

@dougiesquire does that mean we can remove compute_by_blocks from all notebooks? Is this only in Decomposing_kinetic_energy_into_mean_and_transient.ipynb? Any other changes to notebooks we need to make a result of these dask improvements?

adele-morrison avatar Apr 16 '24 06:04 adele-morrison

I'm not really sure without testing, sorry. In this particular instance compute_by_blocks was hiding a deeper issue with the data. The new dask scheduler helped, but only once the data issue was resolved. Each case may well be different.

dougiesquire avatar Apr 16 '24 23:04 dougiesquire

Ok, let's work on this further at Hackathon 4.0 then.

Following @dougiesquire's testing above: compute_by_blocks can be removed from the Decomposing_kinetic_energy_into_mean_and_transient notebook.

Perhaps more testing can also be done for other notebooks (though I'm not sure any other notebooks contain compute_by_blocks?).

adele-morrison avatar Apr 16 '24 23:04 adele-morrison