GCHP icon indicating copy to clipboard operation
GCHP copied to clipboard

GCHP 14.3.1 out of memory when writing checkpoint files

Open YanshunLi-washu opened this issue 9 months ago • 3 comments

Name: Yanshun Li Institution: Washu

Dear Support Team,

I'm recently running GCHP 14.3.1 on the NASA pleiades cluster at C360 resolution for a global simulation.

The model ran well with an average throughput of 3.5 when using 504 cores (21 nodes x 24 cores/node).

However when I increase the number of cores to 1200 (50 nodes x 24 cores/node), the model stopped when writing the first checkpoint file. I encounter the same issue for several times, the program stopped right at the line writing the first checkpoint file. Error message got from email is listed as below:

"Your Pleiades job 19364662.pbspl1.nas.nasa.gov terminated due to one or more nodes running out of memory. Node r515i2n3 ran out of memory and rebooted; others may have run out of memory as well."

No other error outputs in the log. Last a few lines in the log:

"Character Resource Parameter: GCHPchem_INTERNAL_CHECKPOINT_TYPE:pnc4 Using parallel NetCDF for file: Restarts/gcchem_internal_checkpoint.20211002_1200z.nc4"

Relevant files in the run directory is attached. gchp_debug.zip

As far as I know, my college running GCHP 13.4 & 13.2 didn't met similar bugs.

Based on the above info, could you kindly help to take a look on this issue?

Thanks, Yanshun

YanshunLi-washu avatar May 14 '24 08:05 YanshunLi-washu