E3SM
E3SM copied to clipboard
Test `SMS_P12x2.ne4_oQU240.WCYCL1850NS.pm-cpu_intel.allactive-mach_mods` seems to rely on (a bug in) gustiness for stability
When I was running the developer tests for #5850, which I recently rebased onto master
, the test SMS_P12x2.ne4_oQU240.WCYCL1850NS.pm-cpu_intel.allactive-mach_mods
is now crashing. After spending all of yesterday trying to figure this out, I realized that actually, this test can easily be made to fail on the current master
by simply turning atmospheric gustiness off. (I originally did this by setting vmag_gust = 0
in clubb_intr.F90
. Setting use_sgv = .false.
in the EAM namelist also causes the failure.)
The failure takes the form of an invalid operation in ELM, which immediately crashes DEBUG runs, and poisons the state with a NaN
that causes a crash without DEBUG. This is because the eflx_lwrad_out
is negative, i.e. the upward longwave radiation has the wrong sign. (Possibly due to large temperature swings? Temperature drops by ~10K in the grid cells that have negative eflx_lwrad_out
) The diagnosed surface temperature is proportional to the fourth root of this quantity, so a negative value results in NaN
and a crash. The eflx_lwrad_out
becoming negative actually happens in every case (even on master
) within 3-4 time steps, or at least so it appears if I print it out in SoilFluxesMod
. But that negative value doesn't always cause an immediate crash. (I have no idea why.) For some commits I tested, the runs crash very early, whereas if I just remove gustiness on master
, it crashes near the end of the 5 day test. (This suggests that if this test was run for longer, it might actually crash on master
as well!)
To me, that seems to imply that with the recent EAMv3 changes, this test case is very close to the edge of stability, close enough that removing surface gustiness can make the run unstable. I don't think it's really good to depend on the gustiness for stability in general (and as I mentioned, master
may be crashing anyway if we run some of these tests longer). But I'm not sure what v3 changes have caused this problem, so I have no particular ideas about what to do.
This issue may technically not be causing crashes on master
, but it is blocking #5850, so I'm filing it as a bug here.
Randomly tagging @wlin7 @mabrunke @beharrop @bishtgautam as people that may have some ideas here. I'm not sure what to do, or if this is really an EAM or an ELM problem ultimately.
Edit: I should have given this quick 3-line way to reproduce the issue, assuming that the current directory is the E3SM source root:
mkdir components/eam/cime_config/testdefs/testmods_dirs/eam/no_sgv/
echo "clubb_use_sgv = .false." >components/eam/cime_config/testdefs/testmods_dirs/eam/no_sgv/user_nl_eam
cd cime/scripts; create_test SMS_P12x2.ne4_oQU240.WCYCL1850NS.pm-cpu_intel.eam-no_sgv
Only that case and only on that machine/compiler ?
@rljacob Only that case that I've found so far. Let me check on machine/compiler. I doubt it's machine-specific, but we'll see.
@rljacob Reproduced this with SMS.ne4_oQU240.WCYCL1850NS.compy_intel
, so neither the machine nor the PE layout matter.
@quantheory This is interesting. I was able to run the model for a month with my own similar set of bug fixes. Were you able to replicate the crash in a normal run of the model?
@mabrunke I haven't tried a longer run yet, and like I said, this is the only test case I've had fail so far. Furthermore, this test was not failing two weeks ago. It's only when I rebased to include changes made since the beginning of August that the failures started, which suggests that maybe something in the v3 atm features is involved?
@rljacob The test passes with GNU compiler on perlmutter, so maybe this is compiler-specific. Extended the length to a 30 day run and it still didn't manage to crash.
Try adding the debugging flag. SMS_D_P12x2.....
@rljacob
Try adding the debugging flag. SMS_D_P12x2.....
That lets the GNU test run, but also lets the Intel test run on perlmutter. (But earlier Intel was still failing with DEBUG on perlmutter. The two things that have changed since then are that I'm running with #5876 merged, and turning off gustiness in a slightly different way.)
I think I need to be more systematic about this and make a table of different runs with more precisely controlled differences. But right now I can tell you this:
- The failure occurs within the first few time steps (<12 hours) about half the time, but is otherwise often near the end of the 5 day test. This is the single most annoying part of the bug, because it implies that a successful 5 day run might have crashed if it was run for 6 days. I've been sprinkling in ”_Ld10" and "_Ld30" tests to try to deal with this, though in the few times I've tried that, they have always passed if the 5 day test passes.
- I have no failures with other compsets (e.g. F compsets).
- I have no failures from branches that started from master 2 weeks ago (though did not test this compset much before then). I discovered this issue after rebasing onto current master. It seems likely that one of the answer changing PRs merged since August 1 caused the issue. I have not yet bisected to try to find the change responsible.
- I don't yet have a failure with a compiler other than Intel.
- I do have failures both with and without DEBUG on Intel.
- I don't yet have a failure with the original settings on
master
(on any complier, machine, PE layout, run length, or DEBUG setting). - But every change I've made that reduces gustiness (including multiple methods of turning it off) has resulted in a crash at least in some configuration.
- Some tests have been repeated and fail at the exact same time step with similar results both times, which argues against a truly "stochastic" cause like a race condition.
- The cause of failure in DEBUG runs is always that surface longwave emissions become negative, and then cause an error when trying to take the fourth root to diagnose a temperature. So far this appears to occur exclusively in grid cells with urban land. However, this mysteriously does not cause a crash on
master
with default gustiness settings. - For non-DEBUG runs, the crash occurs in a shr_reprosum_mod call in the EAM dycore, and is due to a large number of NaNs in some field. This is consistent with the possibility that the aforementioned longwave radiation issue is poisoning the state sent to the atmosphere, though not absolute proof.
- I do have failures both with and without #5876 merged; I doubt that it matters.
Addendum: Since the GNU and Intel DEBUG tests both passed in 5 day runs on Perlmutter, I ran both tests again with _Ld10
. The GNU test passed again, while the Intel test now failed on day 4. This undermines what I said before; maybe there a stochastic error like a race condition involved here.
Just confirming I see error SMS_Ld10.ne4_oQU240.WCYCL1850NS.pm-cpu_intel.allactive-mach_mods
using recent master. (and pasting error message in case it's searched on)
Though SMS_Ld10_P12x2.ne4_oQU240.WCYCL1850NS.pm-cpu_intel.allactive-mach_mods
did complete.
1: SHR_REPROSUM_CALC: Input contains 0.24600E+03 NaNs and 0.00000E+00 INFs on MPI task 1
1: ERROR: shr_reprosum_calc ERROR: NaNs or INFs in input
1: Image PC Routine Line Source
1: e3sm.exe 0000000003A6715D shr_abort_mod_mp_ 114 shr_abort_mod.F90
1: e3sm.exe 0000000003B8A538 shr_reprosum_mod_ 644 shr_reprosum_mod.F90
1: e3sm.exe 0000000001B2002F compose_repro_sum 436 compose_mod.F90
1: e3sm.exe 00000000016440C8 operator() 370 compose_cedr.cpp
1: e3sm.exe 0000000001646ECF run_horiz_omp 21 compose_cedr_caas.cpp
1: e3sm.exe 0000000001655E20 run_global<Kokkos 48 compose_cedr_sl_run_global.cpp
1: e3sm.exe 00000000015FB14E sl_advection_mp_p 258 sl_advection.F90
1: e3sm.exe 00000000015D1AC5 prim_driver_base_ 1428 prim_driver_base.F90
1: e3sm.exe 0000000001B1046A dyn_comp_mp_dyn_r 401 dyn_comp.F90
1: e3sm.exe 000000000152307E stepon_mp_stepon_ 582 stepon.F90
1: e3sm.exe 000000000053CD94 cam_comp_mp_cam_r 352 cam_comp.F90
1: e3sm.exe 000000000052C582 atm_comp_mct_mp_a 583 atm_comp_mct.F90
1: e3sm.exe 0000000000446D7E component_mod_mp_ 757 component_mod.F90
1: e3sm.exe 0000000000426D34 cime_comp_mod_mp_ 3112 cime_comp_mod.F90
1: e3sm.exe 0000000000446A12 MAIN__ 153 cime_driver.F90
/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/me24-aug15/SMS_Ld10.ne4_oQU240.WCYCL1850NS.pm-cpu_intel.allactive-mach_mods.rgh5884
The above test was using 128x1. These test also fail in what looks like same way:
SMS_P12x1_Ld10.ne4_oQU240.WCYCL1850NS.pm-cpu_intel.allactive-mach_mods
SMS_P64x1_Ld10.ne4_oQU240.WCYCL1850NS.pm-cpu_intel.allactive-mach_mods
SMS_P128x1_Ld10.ne4_oQU240.WCYCL1850NS.pm-cpu_intel
SMS_P64x1_Ld10.ne4_oQU240.WCYCL1850NS.pm-cpu_intel
And these pass:
SMS_D_P64x2_Ld5.ne4_oQU240.WCYCL1850NS.pm-cpu_intel.allactive-mach_mods
SMS_P64x2_Ld10.ne4_oQU240.WCYCL1850NS.pm-cpu_intel
SMS_P64x2_Ld10.ne4_oQU240.WCYCL1850NS.pm-cpu_intel.allactive-mach_mods
@ndkeen Is this just using master out of the box, or have you made any change to gustiness or other settings?
Master of Aug15, no changes
@ndkeen Thanks! This is a valuable hint, since it looks like the same bug, and this is the first case where anyone has seen it on master without any physics changes at all. (@jli628 and @wlin7 are helping me to debug this.)
Wait why do you think its the same bug? Reprosum complaining about a NaN only means a NaN was produced somewhere, not by the same bug.
@rljacob Mainly because all the DEBUG cases where I've encountered the issue fail at the same line in ELM (in the lnd2atm module). And I find it suspicious that it's this one particular test case that keeps failing. But you're right, it could be that some of the non-DEBUG runs are failing differently. I just find that less likely due to Occam's razor.
But the error message Noel posted doesn't show this coming from ELM. And its not threaded. The only thing in common is the resolution and case.
It's true I actually don't know how Seans job failed, but I thought I would just try a few things and was assuming the fail I hit would be related -- could easily be something else. Note I added a few more fail/passes in my above comment.
Does not seem to matter if .allactive-mach_mods
is present.
To narrow further, are there easy things to try instead of ne4_oQU240.WCYCL1850NS
?
I'm now having trouble getting any run to fail with DEBUG enabled. (I guess the optimization changes in DEBUG have enough of an effect?) So I went into here: https://github.com/E3SM-Project/E3SM/blob/8d81d0b1ace84190545428cb197a116d60356c7c/components/elm/src/main/lnd2atmMod.F90#L120-L122
Line 121 there is where some DEBUG runs have crashed with the gustiness changes, due to a negative value of eflx_lwrad_out_grc(g)
inside a sqrt
call.
So, I added these lines just before line 121:
if (eflx_lwrad_out_grc(g) < 0._r8) then
print *, "At g = ", g, ", eflx_lwrad_out_grc = ", eflx_lwrad_out_grc(g)
call endrun("bad eflx_lwrad_out_grc value")
end if
And sure enough, the test SMS_P128x1_Ld10.ne4_oQU240.WCYCL1850NS.pm-cpu_intel.allactive-mach_mods
now crashes with:
43: At g = 153 , eflx_lwrad_out_grc = -3.79971951551529
43: ENDRUN:bad eflx_lwrad_out_grc value
43: ERROR: Unknown error submitted to shr_abort_abort.
So the error on master with no threading does seem to be the same as the error with the gustiness mods with threading. Or at least, it generates NaN in the same line of code.
Update of testing using the branch (https://github.com/quantheory/E3SM/tree/quantheory/gustiness-fixes-for-v3) for the gustiness PR #5850
- Threading nbfb for coupled test. Limited steps completed by
SMS_P12x2.ne4_oQU240.WCYCL1850NS.pm-cpu_intel
SMS_D_P24x1.ne4_oQU240.WCYCL1850NS.pm-cpu_intel
gave different results from step 2. Note thatSMS_D_P36x1
results are bfb withSMS_D_P24x1
. -
PET_D_Ld1_P640x2.ne30pg2_EC30to60E2r2.WCYCL1850.pm-cpu_gnu
also failed threading test, though it can run stably. - Threading issue does not exist with F cases (active atm and lnd, and mpassi prescribed sea ice mode).
PET_Ld1_P12x2.ne4_oQU240.F2010.pm-cpu_intel
. Also see further notes below on DEBUG mode. -
SMS_D_P12x2.ne4_oQU240.WCYCL1850NS.pm-cpu_intel
first reported fatalfloating invalid
in lnd comp, whileSMS_D_P24x1.ne4_oQU240.WCYCL1850NS.pm-cpu_gnu
failed withNaN produced in physics_state by package cam_radheat
. Being more familiar with atm debugging, focus on using pm-cpu_gnu for further testing. - One step before crashing, during step 3,
chunk 117, state%t(ncol=12,72)
first saw acold T of 188.48 K
at bottom level. T at the level above was normal at 254K. The grid is at(58.16N, 241E)
,lndfrac=1.0
. The cold temperature was produced within macro-micro substepping - actually all accumulated frommacrop (clubb) tendencies
. - Upon entering macrop (clubb) substepping, anomalous values seen in surface data obtained from coupler.
cam_in%ts=240.2
andcam_in%shf=-83.18
. During the previous steps, cam_in%shf =~ -22 and cam_in%ts =~ 258, about 1 degree warmer than bottom level air temperature. (It is odd why shf to atm is negative when surface is warmer). The negative shf brought bottom air temperature down from 253.4K (before entering macmic substepping) to 188.46 after completing 6 steps of macrop (clubb) subcycling. - One step later, cam_in%ts became NaN, which was fed to clubb update, leading to NaN values in state%t on all levels at the column. The run would then proceed to report fatal error in due to NaN values produced by cam_radheat.
- The error source apparently is not directly from cam_radheat; and the anomalies were triggered at least one physics (atm/lnd coupling) step earlier. How land processes returned a sudden drop in ts may hold the clue.
Note: The same cause could be responsible for #5955, and particularly #5957. Those tests use master branch without the new gustiness codes.
Further notes: threading non-bfb appear to only exist with DEBUG=TRUE. Test like PET_Ld1_P640x2.ne30pg2_EC30to60E2r2.WCYCL1850.pm-cpu_intel
threading comparison is PASS, while PET_D_Ld1_P640x2.ne30pg2_EC30to60E2r2.WCYCL1850.pm-cpu_intel
a FAIL. Same for pm-cpu_gnu.
PET_D_Ld1_P12x2.ne4_oQU240.F2010.pm-cpu_intel
also failed threading comparison, diff from non-DEBUG PET F2010 test.
@wlin7 This is very interesting. It would be interesting to know some of the cam_out
values produced immediately before cam_in%shf
starts to become very negative. In particular, ubot
, vbot
, tbot
, qbot
, and ugust
. Is this something you can readily provide for the test case you mention above?
If the SHF seems inconsistent with the temperatures, this could mean that the energy balance iteration in the land code is failing to converge. I could try increasing the iteration count for all of those, or specifically the ones over land, and see if that avoids the crashes.
It could be just the atm initial data problem. With a new atm IC remapped from ne30, the test can run without problem (file below on nersc) /global/cfs/cdirs/e3sm/inputdata/atm/cam/inic/homme/NGD_v3atm.ne30pg2_mapped_to_ne4np4.eam.i.0001-01-01-00000.c20230106.nc
The failure of other small grid tests on cdash, such as ERS.ne11_oQU240.WCYCL1850NS.pm-cpu_intel
could be due to the same reason. To test with a new IC for ne11 as well.
More to follow.
It would be interesting to know some of the
cam_out
values produced immediately beforecam_in%shf
starts to become very negative. In particular,ubot
,vbot
,tbot
,qbot
, andugust
. Is this something you can readily provide for the test case you mention above?
Good point, @quantheory . I did print those in cam_out every step towards the end of tphysbc. This may become irrelevant now that a new IC can get the model to run. For record, the values do not look suspicious at step 2 (before having cam_in%ts drops and large negative cam_in%shf, which were seen at step 3). The first number at each line is nstep.
*** DEBUG post-cam_export *** psl/zbot/tbot: 2 101840.02055219682 11.240886502646099 254.90734439146630
*** DEBUG post-cam_export *** ubot/vbot/ugust: 2 -2.4379296890397854E-013 -7.4041353829012924E-012 2.2714749141729875
*** DEBUG post-cam_export *** thbot,qbot,pbot: 2 255.01695655969468 7.4604089547123688E-004 93844.214121831232
*** DEBUG post-cam_export *** netsw,flwds: 2 0.0000000000000000 90.808306799755101
*** DEBUG post-cam_export *** precc,precl: 2 0.0000000000000000 1.1684427779309500E-005
I notice that this issue is still open. Is anyone still investigating this, or should we close this, since updating the IC file seems to have fixed the issue?
Coincidentally, I just found out today that we can still trigger this issue on maint-2.0
by both messing with the CLUBB time step and implementing gustiness changes. (Actually, it was @kchong75 who discovered this.) And it can happen well after initialization, so it's not just an IC issue.
I'm inclined to believe that this issue is due to some part of the land model that is just very close to being numerically unstable, rather than a straightforward bug, but if we find a way to make these crashes stop or become less likely, it may be worth making a PR...