CTSM icon indicating copy to clipboard operation
CTSM copied to clipboard

ctsm-fates nuopc driver run fails on lobata with debug off, but passes with debug on

Open glemieux opened this issue 2 years ago • 3 comments

Brief summary of bug

While trying to run a 10 year 1x1_brazil case to act as a baseline for comparison on lobata with the noupc driver I found that my run fails when debug is set to FALSE, but passes when set to TRUE. By contrast, an mct driver case runs to completion with debug set to FALSE.

General bug information

CTSM version you are using: ctsm5.1.dev095

Does this bug cause significantly incorrect results in the model's science? No

Configurations affected: ctsm-fates

Details of bug

The error was replicated twice with different error message results, although the stack trace appears to be a close match. The following only shows a part of the total stack trace.

First error case:

 ERROR:  ERROR: One or more of the output from CLM to the coupler are NaN
#0  0x7f2b8809dd21 in ???
#1  0x557882a7167a in ???
#2  0x557882a7180b in ???
#3  0x5578824d9e83 in ???
#4  0x5578824cfacb in ???
#5  0x5578824d455f in ???
#6  0x5578824ca35e in ???
#7  0x7f2b8961ca1a in _ZNK5ESMCI13MethodElement7executeEPvPi
        at /home/glemieux/local/esmf/src/esmf-ESMF_8_2_0/src/Superstructure/Component/src/ESMCI_MethodTable.C:377
#8  0x7f2b8961dabf in _ZN5ESMCI11MethodTable7executeENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPvPiPb
        at /home/glemieux/local/esmf/src/esmf-ESMF_8_2_0/src/Superstructure/Component/src/ESMCI_MethodTable.C:563
#9  0x7f2b8961c549 in c_esmc_methodtableexecute_
        at /home/glemieux/local/esmf/src/esmf-ESMF_8_2_0/src/Superstructure/Component/src/ESMCI_MethodTable.C:317

second error case:

 ERROR: (cpl:utils:check_for_errors) ERROR: Bottom layer specific humidty sent from the atmosphere model is less than zero
#0  0x7fa6b524ed21 in ???
#1  0x55798cd257cc in ???
#2  0x55798cd2595d in ???
#3  0x55798c78e05e in ???
#4  0x55798c7894cb in ???
#5  0x55798c77e35e in ???
#6  0x7fa6b67cda1a in _ZNK5ESMCI13MethodElement7executeEPvPi
        at /home/glemieux/local/esmf/src/esmf-ESMF_8_2_0/src/Superstructure/Component/src/ESMCI_MethodTable.C:377
#7  0x7fa6b67ceabf in _ZN5ESMCI11MethodTable7executeENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPvPiPb
        at /home/glemieux/local/esmf/src/esmf-ESMF_8_2_0/src/Superstructure/Component/src/ESMCI_MethodTable.C:563
#8  0x7fa6b67cd549 in c_esmc_methodtableexecute_
        at /home/glemieux/local/esmf/src/esmf-ESMF_8_2_0/src/Superstructure/Component/src/ESMCI_MethodTable.C:317

Important details of your setup / configuration so we can reproduce the bug

The two failed cases occured at different times in the run. The first happened 7 years into the run, the second occurred 4 years into the run. Given the error comment about the atmosphere model I reviewed the last lines of the atm.log and provide them here as reference:

first case:

 atm : model date     20070623       45000
 atm : model date     20070623       46800
 atm : model date     20070623       48600
 atm : model date     20070623       50400
 atm : model date     20070623       52200
 atm : model date     20070623       54000
(shr_strdata_readstrm) reading file ub: /data/cesmdataroot/inputdata/atm/datm7/atm_forcing.datm7.cruncep_qianFill.0.5d.v7.c160715/Precip6Hrly/clmforc.cruncep.V7.c2016.0.5d.Prec.2007-06.nc      92

second case:

 atm : model date     20040925       23400
 atm : model date     20040925       25200
 atm : model date     20040925       27000
 atm : model date     20040925       28800
 atm : model date     20040925       30600
 atm : model date     20040925       32400
(shr_strdata_readstrm) reading file ub: /data/cesmdataroot/inputdata/atm/datm7/atm_forcing.datm7.cruncep_qianFill.0.5d.v7.c160715/Precip6Hrly/clmforc.cruncep.V7.c2016.0.5d.Prec.2004-09.nc      99

Machine configuration: OS: Pop!_OS 20.04 Compiler: gnu 9.4.0 ESMF: 8.2.0 MPI: OpenMPI 4.0.3

glemieux avatar May 17 '22 23:05 glemieux

I should note that I ran this on Cheyenne at @ekluzek suggestion with debug mode off and using the gnu compiler. The case ran to completion there. So this suggests to me that the issue is likely due to an esmf issue on lobata.

glemieux avatar May 17 '22 23:05 glemieux

It would be good to see if we can figure out if this is a compiler specific thing or a ESMF version thing. So running similar cases on other machines would be good to do.

Is this with the gnu compiler? And what version? And what version of ESMF?

ekluzek avatar May 19 '22 18:05 ekluzek

Is this with the gnu compiler? And what version? And what version of ESMF?

Sorry I forgot. I updated the first comment with the machine configuration info at the end:

Machine configuration: OS: Pop!_OS 20.04 Compiler: gnu 9.4.0 ESMF: 8.2.0 MPI: OpenMPI 4.0.3

glemieux avatar May 19 '22 21:05 glemieux