Problem running the 'ice' notebook with 'dask' or in 'serial' mode
Describe the bug
When running the ice notebooks, with dask enabled, it will have some errors like:
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:4147
...
...
KilledWorker: Attempted to run task ..
...
When running in serial mode, it leads to kernel died problem.
To Reproduce
When running the ice key_metrics notebooks:
- Hemis_seaice_visual_compare_contour.ipynb
- Hemis_seaice_visual_compare_obs_lens.ipynb
use:
cupid-diagnostics -ice
will reproduce error with KilleredWorker problem.
use:
cupid-diagnostics -ice -s
will lead to kernel died problem.
see attached detailed logs.
Make sure you have enough memory for this. If you are doing it on casper I recommend 120GB.
Make sure you have enough memory for this. If you are doing it on casper I recommend 120GB.
Thanks for the hint. It is possible with the memory.
I am running this on a Norwegian machine for storage (not HPC). The node the CUPiD is running on has 32CUPS/ 64GB memory but with a burst mode of 64CUPs/128GB. So we share this resources on this node, but not submit batch job in computing node with explicit request for CUPs and memory.
That is understandable. If you try just examples/key_metrics is this better? This only runs the one notebook, but I have a feeling that it is the more memory intensive version. How many years are you running? This will also help reduce the memory.
That is understandable. If you try just examples/key_metrics is this better? This only runs the one notebook, but I have a feeling that it is the more memory intensive version. How many years are you running? This will also help reduce the memory.
Yes, when I do this test, I only run one notebook under this examples/key_metrics. It is in total 20 years used in this diagnostics.
I will probably leave this issue for now, as priority now is to make most of the components diagnostics of CUPiD work for our model settings. But I may port and run CUPiD on hour local HPC with larger memory capacity.
@YanchunHe -- I was looking at this with @dabail10 and in your serial log I see
Error when executing task 'Hemis_seaice_visual_compare_obs_lens'. Partially executed notebook available at /projects/NS16000B-datalake/CUPiD-src/examples/key_metrics/computed
_notebooks/ice/Hemis_seaice_visual_compare_obs_lens.ipynb
Is there any useful information in that file (/projects/NS16000B-datalake/CUPiD-src/examples/key_metrics/computed _notebooks/ice/Hemis_seaice_visual_compare_obs_lens.ipynb)?
For the serial run, Hemis_seaice_visual_compare_contour.ipynb is reporting the
---------------------------------------------------------------------------
Exception encountered at "In [12]":
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[12], line 1
----> 1 client.shutdown()
AttributeError: 'NoneType' object has no attribute 'shutdown'
which is a known issue (see #266 for resolution).
Your dask run is reporting WARNING - Unmanaged memory use is high and Unmanaged memory: 1.45 GiB -- Worker memory limit: 2.00 GiB; If your nodes have 32 CPUs and 64 GB of memory, perhaps you can try reducing the number of workers (and letting some CPUs sit idle) to increase memory. For example, if you only request 8 CPUs instead of the full 32 then each worker should have 8 GiB of memory to work with.
Thanks for the reply @mnlevy1981
I was running CUPiD on the login node, not with a batch job at the backend.
Is there a way to specify the number of CPUs for a normal CUPiD job, in some configuration file, or as an optional parameter?
So, on our machines when running CUPiD from the command line, we start an interactive session. For us using PBS we do the following:
qinteractive -l select=1:ncpus=16:mpiprocs=16:mem=120G -A NCGD0039 -l walltime=06:00:00
So, on our machines when running CUPiD from the command line, we start an interactive session. For us using PBS we do the following:
qinteractive -l select=1:ncpus=16:mpiprocs=16:mem=120G -A NCGD0039 -l walltime=06:00:00
OK, thanks! The node with 32CUPs that I am using a node, with shared memory of 64GB. In this case, I don't know how much CUPs/cores and memory will CUPiD take by default without explicitly specifying them.
When the notebook is run by CUPiD, it uses a dask LocalCluster object that detects how many cores and how much memory is available. If you run a job on N cores, then I believe dask will use N-2 of them for workers -- one will be used to run the notebook, one will be used as the dask task scheduler and the rest are available for parallelization.
Thanks!
I tried to set in config.yml
lc_kwargs:
threads_per_worker: 1
n_workers: 1
And the computed notebook of Hemis_seaice_visual_compare_obs_lens does show that 1 worker is used:
LocalCluster
364bf0d2
Dashboard: http://127.0.0.1:8787/status Workers: 1
Total threads: 1 Total memory: 128.00 GiB
Status: running Using processes: True
However, the cupid-diagnostic stdout still says that 4 workers are used:
KilledWorker: Attempted to run task ('sum-sum-aggregate-mul-1582fccb4b4c8751fb542539a88aa25c', 0, 0, 9) on 4 different workers, but all those workers died while running it. The last worker that attempt to run the task was tcp://127.0.0.1:44449. Inspecting worker logs is often a good next step to diagnose what went wrong. For more information see https://distributed.dask.org/en/stable/killed.html.
I will leave this issue for a while until I have concrete ideas to try 😅