E3SM
E3SM copied to clipboard
Invalid FP error after restart for ERS_D_P128x1_Ln5.ne30_oECv3.F2010
With master of Oct30, I see the error below. I ran several days of SMS with DEBUG without issue. But when I try to restart, I get the error below. I do not see the error with Oct4th repo.
I confirmed this happens after:
f6f7b13c62 2023-10-13 16:07:04 -0700 Azamat Mametjanov Merge branch 'azamat/machines/add-gnu-invalid-check' (PR #5808)
#5808
ie, adding the check for invalid has ... caught an invalid FP. Which does mean it should only happen with GNU and with DEBUG.
Assuming there is an array that is not initialized before use, which only happens after a restart read.
13: Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
13:
13: Backtrace for this error:
13: #0 0x14635c83adbf in ???
13: #0 0x14635c83adbf in ???
13: #1 0x15072fa in __radheat_MOD_radheat_tend
13: at /global/cfs/cdirs/e3sm/ndk/repos/me32-oct30/components/eam/src/physics/cam/radheat.F90:116
13: #1 0x15072fa in __radheat_MOD_radheat_tend
13: at /global/cfs/cdirs/e3sm/ndk/repos/me32-oct30/components/eam/src/physics/cam/radheat.F90:116
13: #2 0xcdc51b in __radiation_MOD_radiation_tend
13: at /global/cfs/cdirs/e3sm/ndk/repos/me32-oct30/components/eam/src/physics/rrtmg/radiation.F90:1588
13: #2 0xcdc51b in __radiation_MOD_radiation_tend
13: at /global/cfs/cdirs/e3sm/ndk/repos/me32-oct30/components/eam/src/physics/rrtmg/radiation.F90:1588
13: #3 0x14a5240 in tphysbc
13: at /global/cfs/cdirs/e3sm/ndk/repos/me32-oct30/components/eam/src/physics/cam/physpkg.F90:3052
13: #3 0x14a5240 in tphysbc
13: at /global/cfs/cdirs/e3sm/ndk/repos/me32-oct30/components/eam/src/physics/cam/physpkg.F90:3052
13: #4 0x14cbd4d in __physpkg_MOD_phys_run1._omp_fn.0
13: at /global/cfs/cdirs/e3sm/ndk/repos/me32-oct30/components/eam/src/physics/cam/physpkg.F90:1175
13: #4 0x14cbd4d in __physpkg_MOD_phys_run1._omp_fn.0
13: at /global/cfs/cdirs/e3sm/ndk/repos/me32-oct30/components/eam/src/physics/cam/physpkg.F90:1175
13: #5 0x14635ce5b295 in GOMP_parallel
13: at ../../../cpe-gcc-11.2.0-202108140355.9bf1fd589a5c1/libgomp/parallel.c:178
13: #5 0x14635ce63a55 in gomp_thread_start
13: at ../../../cpe-gcc-11.2.0-202108140355.9bf1fd589a5c1/libgomp/team.c:125
13: #6 0x14635d0966e9 in ???
13: #7 0x14635c90849e in ???
13: #8 0xffffffffffffffff in ???
I see same error without threads 128x1
as well as with Intel compiler:
3: forrtl: error (65): floating invalid
3: Image PC Routine Line Source
3: libpthread-2.31.s 000014B43EB19910 Unknown Unknown Unknown
3: e3sm.exe 00000000035FB444 radheat_mp_radhea 116 radheat.F90
3: e3sm.exe 000000000174B32C radiation_mp_radi 1587 radiation.F90
3: e3sm.exe 00000000035544CF physpkg_mp_tphysb 3059 physpkg.F90
3: e3sm.exe 000000000350DDCA physpkg_mp_phys_r 1168 physpkg.F90
3: libiomp5.so 000014B43E81CF13 __kmp_invoke_micr Unknown Unknown
3: libiomp5.so 000014B43E78BEF3 Unknown Unknown Unknown
3: libiomp5.so 000014B43E78D178 __kmp_fork_call Unknown Unknown
3: libiomp5.so 000014B43E745D23 __kmpc_fork_call Unknown Unknown
3: e3sm.exe 000000000350C170 physpkg_mp_phys_r 1153 physpkg.F90
3: e3sm.exe 000000000093F863 cam_comp_mp_cam_r 268 cam_comp.F90
3: e3sm.exe 00000000008FC7B9 atm_comp_mct_mp_a 425 atm_comp_mct.F90
3: e3sm.exe 00000000004A2705 component_mod_mp_ 257 component_mod.F90
3: e3sm.exe 000000000045EC59 cime_comp_mod_mp_ 2324 cime_comp_mod.F90
3: e3sm.exe 0000000000499686 MAIN__ 122 cime_driver.F90
I think it's one of these arrays
do i = 1, ncol
net_flx(i) = fsnt(i) - fsns(i) - flnt(i) + flns(i)
end do
Adding print statements show that it looks like all of these arrays fail --suggesting they may all be NaN after restart.