CUPiD Problem running the 'ice' notebook with 'dask' or in 'serial' mode

Describe the bug

When running the ice notebooks, with dask enabled, it will have some errors like:

distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:4147
...

...
KilledWorker: Attempted to run task ..

...

When running in serial mode, it leads to kernel died problem.

To Reproduce

When running the ice key_metrics notebooks:

Hemis_seaice_visual_compare_contour.ipynb
Hemis_seaice_visual_compare_obs_lens.ipynb

use:

cupid-diagnostics -ice

will reproduce error with KilleredWorker problem.

use:

cupid-diagnostics -ice -s

will lead to kernel died problem.

see attached detailed logs.

log1_with_dask.log

log2_serial_kernel_died.log

Aug 28 '25 18:08 YanchunHe

Make sure you have enough memory for this. If you are doing it on casper I recommend 120GB.

Aug 28 '25 18:08 dabail10

Make sure you have enough memory for this. If you are doing it on casper I recommend 120GB.

Thanks for the hint. It is possible with the memory.

I am running this on a Norwegian machine for storage (not HPC). The node the CUPiD is running on has 32CUPS/ 64GB memory but with a burst mode of 64CUPs/128GB. So we share this resources on this node, but not submit batch job in computing node with explicit request for CUPs and memory.

Aug 28 '25 19:08 YanchunHe

That is understandable. If you try just examples/key_metrics is this better? This only runs the one notebook, but I have a feeling that it is the more memory intensive version. How many years are you running? This will also help reduce the memory.

Aug 29 '25 19:08 dabail10

That is understandable. If you try just examples/key_metrics is this better? This only runs the one notebook, but I have a feeling that it is the more memory intensive version. How many years are you running? This will also help reduce the memory.

Yes, when I do this test, I only run one notebook under this examples/key_metrics. It is in total 20 years used in this diagnostics.

I will probably leave this issue for now, as priority now is to make most of the components diagnostics of CUPiD work for our model settings. But I may port and run CUPiD on hour local HPC with larger memory capacity.

Aug 29 '25 20:08 YanchunHe

@YanchunHe -- I was looking at this with @dabail10 and in your serial log I see

Error when executing task 'Hemis_seaice_visual_compare_obs_lens'. Partially executed notebook available at /projects/NS16000B-datalake/CUPiD-src/examples/key_metrics/computed
_notebooks/ice/Hemis_seaice_visual_compare_obs_lens.ipynb

Is there any useful information in that file (/projects/NS16000B-datalake/CUPiD-src/examples/key_metrics/computed _notebooks/ice/Hemis_seaice_visual_compare_obs_lens.ipynb)?

For the serial run, Hemis_seaice_visual_compare_contour.ipynb is reporting the

---------------------------------------------------------------------------
Exception encountered at "In [12]":
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[12], line 1
----> 1 client.shutdown()

AttributeError: 'NoneType' object has no attribute 'shutdown'

which is a known issue (see #266 for resolution).

Your dask run is reporting WARNING - Unmanaged memory use is high and Unmanaged memory: 1.45 GiB -- Worker memory limit: 2.00 GiB; If your nodes have 32 CPUs and 64 GB of memory, perhaps you can try reducing the number of workers (and letting some CPUs sit idle) to increase memory. For example, if you only request 8 CPUs instead of the full 32 then each worker should have 8 GiB of memory to work with.

Aug 29 '25 21:08 mnlevy1981

Thanks for the reply @mnlevy1981

I was running CUPiD on the login node, not with a batch job at the backend.

Is there a way to specify the number of CPUs for a normal CUPiD job, in some configuration file, or as an optional parameter?

Sep 03 '25 11:09 YanchunHe

So, on our machines when running CUPiD from the command line, we start an interactive session. For us using PBS we do the following:

qinteractive -l select=1:ncpus=16:mpiprocs=16:mem=120G -A NCGD0039 -l walltime=06:00:00

Sep 03 '25 16:09 dabail10

So, on our machines when running CUPiD from the command line, we start an interactive session. For us using PBS we do the following:

qinteractive -l select=1:ncpus=16:mpiprocs=16:mem=120G -A NCGD0039 -l walltime=06:00:00

OK, thanks! The node with 32CUPs that I am using a node, with shared memory of 64GB. In this case, I don't know how much CUPs/cores and memory will CUPiD take by default without explicitly specifying them.

Sep 03 '25 19:09 YanchunHe

When the notebook is run by CUPiD, it uses a dask LocalCluster object that detects how many cores and how much memory is available. If you run a job on N cores, then I believe dask will use N-2 of them for workers -- one will be used to run the notebook, one will be used as the dask task scheduler and the rest are available for parallelization.

Sep 03 '25 20:09 mnlevy1981

Thanks!

I tried to set in config.yml

  lc_kwargs:
    threads_per_worker: 1
    n_workers: 1

And the computed notebook of Hemis_seaice_visual_compare_obs_lens does show that 1 worker is used:

LocalCluster
364bf0d2

Dashboard: http://127.0.0.1:8787/status	Workers: 1
Total threads: 1	Total memory: 128.00 GiB
Status: running	Using processes: True

However, the cupid-diagnostic stdout still says that 4 workers are used:

KilledWorker: Attempted to run task ('sum-sum-aggregate-mul-1582fccb4b4c8751fb542539a88aa25c', 0, 0, 9) on 4 different workers, but all those workers died while running it. The last worker that attempt to run the task was tcp://127.0.0.1:44449. Inspecting worker logs is often a good next step to diagnose what went wrong. For more information see https://distributed.dask.org/en/stable/killed.html.

I will leave this issue for a while until I have concrete ideas to try 😅

Sep 04 '25 12:09 YanchunHe