GCHP
GCHP copied to clipboard
GCHP 14.3.1 out of memory when writing checkpoint files
Name: Yanshun Li Institution: Washu
Dear Support Team,
I'm recently running GCHP 14.3.1 on the NASA pleiades cluster at C360 resolution for a global simulation.
The model ran well with an average throughput of 3.5 when using 504 cores (21 nodes x 24 cores/node).
However when I increase the number of cores to 1200 (50 nodes x 24 cores/node), the model stopped when writing the first checkpoint file. I encounter the same issue for several times, the program stopped right at the line writing the first checkpoint file. Error message got from email is listed as below:
"Your Pleiades job 19364662.pbspl1.nas.nasa.gov terminated due to one or more nodes running out of memory. Node r515i2n3 ran out of memory and rebooted; others may have run out of memory as well."
No other error outputs in the log. Last a few lines in the log:
"Character Resource Parameter: GCHPchem_INTERNAL_CHECKPOINT_TYPE:pnc4 Using parallel NetCDF for file: Restarts/gcchem_internal_checkpoint.20211002_1200z.nc4"
Relevant files in the run directory is attached. gchp_debug.zip
As far as I know, my college running GCHP 13.4 & 13.2 didn't met similar bugs.
Based on the above info, could you kindly help to take a look on this issue?
Thanks, Yanshun