global-workflow
global-workflow copied to clipboard
gdaseupd memory issues on Hera
What is wrong?
Periodically, the gdaseupd job fails on Hera with memory issues when run at C384. The job usually runs successfully on the second iteration. Reported by @wx20jjung @CatherineThomas-NOAA.
What should have happened?
The job should have enough memory to complete successfully.
What machines are impacted?
Hera
Steps to reproduce
Run a C384 experiment with 80 members. Eventually, an eupd job will fail.
Additional information
N/A
Do you have a proposed solution?
@wx20jjung found a solution to be to change the runtime layout to 5 PEs per node with 8 threads (instead of 8 PEs/5 threads) and 80 PEs total (instead of 270). This resulted in much shorter wait times and only about 5 minutes longer run time.