CTSM Simulations with less then 128 processors on Derecho may die to lack of memory

Brief summary of bug

With ESCOMP/CTSM#3125, in the versions of ccs_config/cime coming in with ctsm5.3.050 simulations on Derecho using less than a full node of processors (TOTALPES = TASKS * THREADS) may fail on submission due to a lack of enough memory asked for.

General bug information

CTSM version you are using: [output of git describe]

Does this bug cause significantly incorrect results in the model's science? No

Configurations affected: Derecho and totalpes < 128 (such as mpi-serial)

Details of bug

This shows up for single point cases. But, can also show up for cases using only a few processors on a single node.

I also notice it more for FATES cases than non-FATES as FATES uses more memory.

non-FATES uses around 3 GBytes FATES (short cases) uses around 6 GBytes

(NOTE: The OS must need around 5 GBytes so the total memory needs to be over that sum)

Important details of your setup / configuration so we can reproduce the bug

The way to turn it on (and the workaround for it are the same thing). Either lower the memory per task to have it fail, or increase it to get it to not fail.

For example to increase memory to 20 GBytes to get a case to work:

./xmlchange MEM_PER_TASK=20
./case.setup --reset
./case.build
./case.submit

Important output or errors that show the problem

Case fails after submission. May fail in a few ways:

Hangs and dies because wallclock exceeded

Here's such a case that normally takes around 400 seconds, dying after 1200 seconds of wallclock.

Batch output in $CASEROOT/test.*

_gnu.clm-FatesColdSatPhen.GC.fatessci184api4ctsm535fs_gnu/bld/cesm.exe   >> cesm.log.$LID 2>&1
=>> PBS: job killed: walltime 1216 exceeded limit 1200

cesm.log:

dec2435.hsn.de.hpc.ucar.edu: rank 0 died from signal 15

email from PBS (if asked for):

PBS Job Id: 9654800.desched1
Job Name:   test.SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm60Fates.derecho_intel.clm-FatesFireLightningPopDens--clm-NEON-FATES-NIWO.GC.fatessci184api4ctsm535fs_int
Aborted by PBS Server
Job exceeded resource walltime
See job standard error file

For something that tested previously it can be helpful to look at the wallclock time for what was successful, and see if it's just hanging like this.

Dies without any indicators

It may die without giving any indicators. No email sent, no information in the batch output file, nothing obvious in the cesm.log or other log files. It usually will die somewhere in initialization as memory usage ramps up, before flattening after starting to run.

NOTE: FATES is going to be increasing memory as it runs, so FATES cases could die long after initialization but with memory usage increasing. See the med.log file to see the memory usage.

May 28 '25 21:05 ekluzek

@adrifoster will send @erik a link to a 'long' FATES single point case

May 29 '25 16:05 wwieder

In @adrifoster case I saw memory max out in about the third submission at 5,725GBytes. After that it even dropped back a little bit, so seemed pretty stable. So I think running FATES with 20 GBytes should be fine.

But, it probably does say that we should modify FATES cases so that whenever FATES is run it asks for 20 GBytes rather than 10 and NOT just for the test cases.

May 30 '25 07:05 ekluzek