GFDL_atmos_cubed_sphere
GFDL_atmos_cubed_sphere copied to clipboard
sporadic floating point errors in a2b_edge.F90 for regional configurations
Describe the bug
Regional configurations of UFS FV3 abort sporadically with a floating-point exception in subroutine a2b_ord2 in FV3/atmos_cubed_sphere/model/a2b_edge.F90 when compiled in debug mode. The crash is here:
if (gridstruct%grid_type < 3) then
if (gridstruct%bounded_domain) then
do j=js-2,je+1+2
do i=is-2,ie+1+2
qout(i,j) = 0.25*(qin(i-1,j-1)+qin(i,j-1)+qin(i-1,j)+qin(i,j)) ! <------- crashes here
enddo
enddo
else
Full stack trace
112:
112: WARNING from PE 112: atmos_modeldefine_blocks_packed: domain ( 33 19) is not an even divisor with definition ( 32) - blocks will not be uniform with a remainder of 19
112:
112: [h11c41:455655:0:455655] Caught signal 8 (Floating point exception: floating-point invalid operation)
112: ==== backtrace (tid: 455655) ====
112: 0 0x00000000000534e9 ucs_debug_print_backtrace() ???:0
112: 1 0x0000000000012cf0 __funlockfile() :0
112: 2 0x0000000004ba5714 a2b_edge_mod_mp_a2b_ord2_() /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/FV3/atmos_cubed_sphere/model/a2b_edge.F90:382
112: 3 0x0000000002bccce6 L_dyn_core_mod_mp_adv_pe__1630__par_loop0_2_108() /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/FV3/atmos_cubed_sphere/model/dyn_core.F90:1665
112: 4 0x000000000013fbb3 __kmp_invoke_microtask() ???:0
112: 5 0x00000000000bbfac __kmp_fork_call() /nfs/site/proj/openmp/promo/20211013/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxilab153/../../src/kmp_runtime.cpp:2111
112: 6 0x000000000007dcb5 __kmpc_fork_call() /nfs/site/proj/openmp/promo/20211013/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxilab153/../../src/kmp_csupport.cpp:358
112: 7 0x0000000002bc674f dyn_core_mod_mp_adv_pe_() /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/FV3/atmos_cubed_sphere/model/dyn_core.F90:1630
112: 8 0x0000000002b689ea dyn_core_mod_mp_dyn_core_() /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/FV3/atmos_cubed_sphere/model/dyn_core.F90:1280
112: 9 0x0000000002ce48d4 fv_dynamics_mod_mp_fv_dynamics_() /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/FV3/atmos_cubed_sphere/model/fv_dynamics.F90:683
112: 10 0x00000000028bd928 atmosphere_mod_mp_atmosphere_dynamics_() /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/FV3/atmos_cubed_sphere/driver/fvGFS/atmosphere.F90:683
112: 11 0x00000000020b079c atmos_model_mod_mp_update_atmos_model_dynamics_() /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/FV3/atmos_model.F90:880
112: 12 0x0000000001b4014c module_fcst_grid_comp_mp_fcst_run_phase_1_() /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/FV3/module_fcst_grid_comp.F90:1330
112: 13 0x0000000000aa2644 ESMCI::FTable::callVFuncPtr() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:2167
112: 14 0x0000000000aa61ef ESMCI_FTableCallEntryPointVMHop() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:824
112: 15 0x000000000094dbea ESMCI::VMK::enter() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Infrastructure/VM/src/ESMCI_VMKernel.C:1247
112: 16 0x000000000121eeaf ESMCI::VM::enter() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Infrastructure/VM/src/ESMCI_VM.C:1216
112: 17 0x0000000000aa3a8a c_esmc_ftablecallentrypointvm_() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:981
112: 18 0x0000000000970d50 esmf_compmod_mp_esmf_compexecute_() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMF_Comp.F90:1252
112: 19 0x0000000000ca5351 esmf_gridcompmod_mp_esmf_gridcomprun_() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMF_GridComp.F90:1903
112: 20 0x0000000001b0b54e fv3atm_cap_mod_mp_modeladvance_phase1_() /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/FV3/fv3_cap.F90:1077
112: 21 0x0000000001b0a615 fv3atm_cap_mod_mp_modeladvance_() /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/FV3/fv3_cap.F90:1026
112: 22 0x00000000006aba58 ESMCI::MethodElement::execute() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:377
112: 23 0x00000000006ab9ba ESMCI::MethodTable::execute() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:563
112: 24 0x00000000006aa582 c_esmc_methodtableexecute_() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:317
112: 25 0x000000000047c492 esmf_attachmethodsmod_mp_esmf_methodgridcompexecute_() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/AttachMethods/src/ESMF_AttachMethods.F90:1287
112: 26 0x0000000004e0e71d nuopc_modelbase_mp_routine_run_() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/addon/NUOPC/src/NUOPC_ModelBase.F90:2212
112: 27 0x0000000000aa2644 ESMCI::FTable::callVFuncPtr() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:2167
112: 28 0x0000000000aa61ef ESMCI_FTableCallEntryPointVMHop() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:824
112: 29 0x000000000094d9da ESMCI::VMK::enter() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Infrastructure/VM/src/ESMCI_VMKernel.C:2501
112: 30 0x000000000121eeaf ESMCI::VM::enter() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Infrastructure/VM/src/ESMCI_VM.C:1216
112: 31 0x0000000000aa3a8a c_esmc_ftablecallentrypointvm_() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:981
112: 32 0x0000000000970d50 esmf_compmod_mp_esmf_compexecute_() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMF_Comp.F90:1252
112: 33 0x0000000000ca5351 esmf_gridcompmod_mp_esmf_gridcomprun_() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMF_GridComp.F90:1903
112: 34 0x00000000008d1317 nuopc_driver_mp_routine_executegridcomp_() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:3694
112: 35 0x00000000008d0b6a nuopc_driver_mp_executerunsequence_() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:3940
112: 36 0x00000000006aba58 ESMCI::MethodElement::execute() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:377
112: 37 0x00000000006ab9ba ESMCI::MethodTable::execute() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:563
112: 38 0x00000000006aa582 c_esmc_methodtableexecute_() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:317
112: 39 0x000000000047c492 esmf_attachmethodsmod_mp_esmf_methodgridcompexecute_() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/AttachMethods/src/ESMF_AttachMethods.F90:1287
112: 40 0x00000000008cdbb2 nuopc_driver_mp_routine_run_() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:3615
112: 41 0x0000000000aa2644 ESMCI::FTable::callVFuncPtr() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:2167
112: 42 0x0000000000aa61ef ESMCI_FTableCallEntryPointVMHop() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:824
112: 43 0x000000000094d9da ESMCI::VMK::enter() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Infrastructure/VM/src/ESMCI_VMKernel.C:2501
112: 44 0x000000000121eeaf ESMCI::VM::enter() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Infrastructure/VM/src/ESMCI_VM.C:1216
112: 45 0x0000000000aa3a8a c_esmc_ftablecallentrypointvm_() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:981
112: 46 0x0000000000970d50 esmf_compmod_mp_esmf_compexecute_() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMF_Comp.F90:1252
112: 47 0x0000000000ca5351 esmf_gridcompmod_mp_esmf_gridcomprun_() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMF_GridComp.F90:1903
112: 48 0x000000000042fae6 MAIN__() /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/driver/UFS.F90:406
112: 49 0x000000000042bfa2 main() ???:0
112: 50 0x000000000003ad85 __libc_start_main() ???:0
112: 51 0x000000000042beae _start() ???:0
112: =================================
112: forrtl: error (75): floating point exception
112: Image PC Routine Line Source
112: fv3.exe 000000000C1EE34B Unknown Unknown Unknown
112: libpthread-2.28.s 0000150AC4D0BCF0 Unknown Unknown Unknown
112: fv3.exe 0000000004BA5714 a2b_edge_mod_mp_a 382 a2b_edge.F90
112: fv3.exe 0000000002BCCCE6 dyn_core_mod_mp_a 1665 dyn_core.F90
112: libiomp5.so 0000150AC7D74BB3 __kmp_invoke_micr Unknown Unknown
112: libiomp5.so 0000150AC7CF0FAC __kmp_fork_call Unknown Unknown
112: libiomp5.so 0000150AC7CB2CB5 __kmpc_fork_call Unknown Unknown
112: fv3.exe 0000000002BC674F dyn_core_mod_mp_a 1630 dyn_core.F90
112: fv3.exe 0000000002B689EA dyn_core_mod_mp_d 1280 dyn_core.F90
112: fv3.exe 0000000002CE48D4 fv_dynamics_mod_m 683 fv_dynamics.F90
112: fv3.exe 00000000028BD928 atmosphere_mod_mp 683 atmosphere.F90
112: fv3.exe 00000000020B079C atmos_model_mod_m 880 atmos_model.F90
112: fv3.exe 0000000001B4014C module_fcst_grid_ 1330 module_fcst_grid_comp.F90
112: fv3.exe 0000000000AA2644 Unknown Unknown Unknown
112: fv3.exe 0000000000AA61EF Unknown Unknown Unknown
112: fv3.exe 000000000094DBEA Unknown Unknown Unknown
112: fv3.exe 000000000121EEAF Unknown Unknown Unknown
112: fv3.exe 0000000000AA3A8A Unknown Unknown Unknown
112: fv3.exe 0000000000970D50 Unknown Unknown Unknown
112: fv3.exe 0000000000CA5351 Unknown Unknown Unknown
112: fv3.exe 0000000001B0B54E fv3atm_cap_mod_mp 1077 fv3_cap.F90
112: fv3.exe 0000000001B0A615 fv3atm_cap_mod_mp 1026 fv3_cap.F90
112: fv3.exe 00000000006ABA58 Unknown Unknown Unknown
112: fv3.exe 00000000006AB9BA Unknown Unknown Unknown
112: fv3.exe 00000000006AA582 Unknown Unknown Unknown
112: fv3.exe 000000000047C492 Unknown Unknown Unknown
112: fv3.exe 0000000004E0E71D Unknown Unknown Unknown
112: fv3.exe 0000000000AA2644 Unknown Unknown Unknown
112: fv3.exe 0000000000AA61EF Unknown Unknown Unknown
112: fv3.exe 000000000094D9DA Unknown Unknown Unknown
112: fv3.exe 000000000121EEAF Unknown Unknown Unknown
112: fv3.exe 0000000000AA3A8A Unknown Unknown Unknown
112: fv3.exe 0000000000970D50 Unknown Unknown Unknown
112: fv3.exe 0000000000CA5351 Unknown Unknown Unknown
112: fv3.exe 00000000008D1317 Unknown Unknown Unknown
112: fv3.exe 00000000008D0B6A Unknown Unknown Unknown
112: fv3.exe 00000000006ABA58 Unknown Unknown Unknown
112: fv3.exe 00000000006AB9BA Unknown Unknown Unknown
112: fv3.exe 00000000006AA582 Unknown Unknown Unknown
112: fv3.exe 000000000047C492 Unknown Unknown Unknown
112: fv3.exe 00000000008CDBB2 Unknown Unknown Unknown
112: fv3.exe 0000000000AA2644 Unknown Unknown Unknown
112: fv3.exe 0000000000AA61EF Unknown Unknown Unknown
112: fv3.exe 000000000094D9DA Unknown Unknown Unknown
112: fv3.exe 000000000121EEAF Unknown Unknown Unknown
112: fv3.exe 0000000000AA3A8A Unknown Unknown Unknown
112: fv3.exe 0000000000970D50 Unknown Unknown Unknown
112: fv3.exe 0000000000CA5351 Unknown Unknown Unknown
112: fv3.exe 000000000042FAE6 MAIN__ 406 UFS.F90
112: fv3.exe 000000000042BFA2 Unknown Unknown Unknown
112: libc-2.28.so 0000150AC4756D85 __libc_start_main Unknown Unknown
112: fv3.exe 000000000042BEAE Unknown Unknown Unknown
The crash is a floating-point exception. There are only additions and multiplications, so the exception is probably from a NaN. This could be due to uninitialized memory, or due to not filling boundary conditions (which are initialized with signalling NaN).
Crashes seem to start after #344 was merged. If so, that PR shouldn't have been merged; the regression test system should've detected this problem. Unfortunately, the ufs-weather-model regression test system is presently unable to detect the difference between a crash and a test's results changing. A fix for the regression test system bug is being tested now.
Unfortunately, we're stuck with broken authoritative branches until this bug is fixed.
From skimming the changes in #344, my best guess is that some parts of the omga array are uninitialized for regional cases due to removing the initialization loop. I haven't had a chance to test that hypothesis yet.
To Reproduce
- Set up on Hera the ufs-weather-model regression test system to not retry jobs, and not delete logs or run directories.
- Run all ufs-weather-model regression tests that have both "conus13km" and "debug" in their name.
- Check for floating point exceptions in failed tests before the regression test system deletes the logs.
The fix for the regression test system is in this PR:
- https://github.com/ufs-community/ufs-weather-model/pull/2335
That is being tested now. Once it's merged, model crashes will be detectable in regression tests once again.
Expected behavior Model runs to completion when compiled in debug mode.
System Environment UFS Weather Model regression test system with Intel compiler on Hera. That's Intel 2021.5.0 with IMPI 2021.5.1 and FMS 2023.04 using Spack Stack 1.6.0.
Here's the uname -a output from a login node:
Linux hfe09 4.18.0-477.27.1.el8_8.x86_64 #1 SMP Wed Sep 20 15:55:39 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Additional context Can't think of anything.