E3SM icon indicating copy to clipboard operation
E3SM copied to clipboard

Invalid FP error after restart for ERS_D_P128x1_Ln5.ne30_oECv3.F2010

Open ndkeen opened this issue 1 year ago • 2 comments

With master of Oct30, I see the error below. I ran several days of SMS with DEBUG without issue. But when I try to restart, I get the error below. I do not see the error with Oct4th repo.

I confirmed this happens after: f6f7b13c62 2023-10-13 16:07:04 -0700 Azamat Mametjanov Merge branch 'azamat/machines/add-gnu-invalid-check' (PR #5808) #5808

ie, adding the check for invalid has ... caught an invalid FP. Which does mean it should only happen with GNU and with DEBUG.

Assuming there is an array that is not initialized before use, which only happens after a restart read.

 13: Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
 13: 
 13: Backtrace for this error:
 13: #0  0x14635c83adbf in ???
 13: #0  0x14635c83adbf in ???
 13: #1  0x15072fa in __radheat_MOD_radheat_tend
 13:    at /global/cfs/cdirs/e3sm/ndk/repos/me32-oct30/components/eam/src/physics/cam/radheat.F90:116
 13: #1  0x15072fa in __radheat_MOD_radheat_tend
 13:    at /global/cfs/cdirs/e3sm/ndk/repos/me32-oct30/components/eam/src/physics/cam/radheat.F90:116
 13: #2  0xcdc51b in __radiation_MOD_radiation_tend
 13:    at /global/cfs/cdirs/e3sm/ndk/repos/me32-oct30/components/eam/src/physics/rrtmg/radiation.F90:1588
 13: #2  0xcdc51b in __radiation_MOD_radiation_tend
 13:    at /global/cfs/cdirs/e3sm/ndk/repos/me32-oct30/components/eam/src/physics/rrtmg/radiation.F90:1588
 13: #3  0x14a5240 in tphysbc
 13:    at /global/cfs/cdirs/e3sm/ndk/repos/me32-oct30/components/eam/src/physics/cam/physpkg.F90:3052
 13: #3  0x14a5240 in tphysbc
 13:    at /global/cfs/cdirs/e3sm/ndk/repos/me32-oct30/components/eam/src/physics/cam/physpkg.F90:3052
 13: #4  0x14cbd4d in __physpkg_MOD_phys_run1._omp_fn.0
 13:    at /global/cfs/cdirs/e3sm/ndk/repos/me32-oct30/components/eam/src/physics/cam/physpkg.F90:1175
 13: #4  0x14cbd4d in __physpkg_MOD_phys_run1._omp_fn.0
 13:    at /global/cfs/cdirs/e3sm/ndk/repos/me32-oct30/components/eam/src/physics/cam/physpkg.F90:1175
 13: #5  0x14635ce5b295 in GOMP_parallel
 13:    at ../../../cpe-gcc-11.2.0-202108140355.9bf1fd589a5c1/libgomp/parallel.c:178
 13: #5  0x14635ce63a55 in gomp_thread_start
 13:    at ../../../cpe-gcc-11.2.0-202108140355.9bf1fd589a5c1/libgomp/team.c:125
 13: #6  0x14635d0966e9 in ???
 13: #7  0x14635c90849e in ???
 13: #8  0xffffffffffffffff in ???

ndkeen avatar Oct 31 '23 20:10 ndkeen

I see same error without threads 128x1 as well as with Intel compiler:

  3: forrtl: error (65): floating invalid
  3: Image              PC                Routine            Line        Source
  3: libpthread-2.31.s  000014B43EB19910  Unknown               Unknown  Unknown
  3: e3sm.exe           00000000035FB444  radheat_mp_radhea         116  radheat.F90
  3: e3sm.exe           000000000174B32C  radiation_mp_radi        1587  radiation.F90
  3: e3sm.exe           00000000035544CF  physpkg_mp_tphysb        3059  physpkg.F90
  3: e3sm.exe           000000000350DDCA  physpkg_mp_phys_r        1168  physpkg.F90
  3: libiomp5.so        000014B43E81CF13  __kmp_invoke_micr     Unknown  Unknown
  3: libiomp5.so        000014B43E78BEF3  Unknown               Unknown  Unknown
  3: libiomp5.so        000014B43E78D178  __kmp_fork_call       Unknown  Unknown
  3: libiomp5.so        000014B43E745D23  __kmpc_fork_call      Unknown  Unknown
  3: e3sm.exe           000000000350C170  physpkg_mp_phys_r        1153  physpkg.F90
  3: e3sm.exe           000000000093F863  cam_comp_mp_cam_r         268  cam_comp.F90
  3: e3sm.exe           00000000008FC7B9  atm_comp_mct_mp_a         425  atm_comp_mct.F90
  3: e3sm.exe           00000000004A2705  component_mod_mp_         257  component_mod.F90
  3: e3sm.exe           000000000045EC59  cime_comp_mod_mp_        2324  cime_comp_mod.F90
  3: e3sm.exe           0000000000499686  MAIN__                    122  cime_driver.F90

ndkeen avatar Nov 08 '23 00:11 ndkeen

I think it's one of these arrays

   do i = 1, ncol
      net_flx(i) = fsnt(i) - fsns(i) - flnt(i) + flns(i)
   end do

Adding print statements show that it looks like all of these arrays fail --suggesting they may all be NaN after restart.

ndkeen avatar Nov 08 '23 00:11 ndkeen