global-workflow icon indicating copy to clipboard operation
global-workflow copied to clipboard

gdaseupd memory issues on Hera

Open DavidHuber-NOAA opened this issue 3 months ago • 6 comments

What is wrong?

Periodically, the gdaseupd job fails on Hera with memory issues when run at C384. The job usually runs successfully on the second iteration. Reported by @wx20jjung @CatherineThomas-NOAA.

What should have happened?

The job should have enough memory to complete successfully.

What machines are impacted?

Hera

Steps to reproduce

Run a C384 experiment with 80 members. Eventually, an eupd job will fail.

Additional information

N/A

Do you have a proposed solution?

@wx20jjung found a solution to be to change the runtime layout to 5 PEs per node with 8 threads (instead of 8 PEs/5 threads) and 80 PEs total (instead of 270). This resulted in much shorter wait times and only about 5 minutes longer run time.

DavidHuber-NOAA avatar Apr 03 '24 12:04 DavidHuber-NOAA