GFDL_atmos_cubed_sphere icon indicating copy to clipboard operation
GFDL_atmos_cubed_sphere copied to clipboard

sporadic floating point errors in a2b_edge.F90 for regional configurations

Open SamuelTrahanNOAA opened this issue 7 months ago • 24 comments

Describe the bug

Regional configurations of UFS FV3 abort sporadically with a floating-point exception in subroutine a2b_ord2 in FV3/atmos_cubed_sphere/model/a2b_edge.F90 when compiled in debug mode. The crash is here:

    if (gridstruct%grid_type < 3) then

       if (gridstruct%bounded_domain) then

          do j=js-2,je+1+2   
             do i=is-2,ie+1+2
                qout(i,j) = 0.25*(qin(i-1,j-1)+qin(i,j-1)+qin(i-1,j)+qin(i,j)) ! <------- crashes here
             enddo
          enddo

       else
Full stack trace
112:
112: WARNING from PE   112: atmos_modeldefine_blocks_packed: domain (  33  19) is not an even divisor with definition (  32) - blocks will not be uniform with a remainder of   19
112:
112: [h11c41:455655:0:455655] Caught signal 8 (Floating point exception: floating-point invalid operation)
112: ==== backtrace (tid: 455655) ====
112:  0 0x00000000000534e9 ucs_debug_print_backtrace()  ???:0
112:  1 0x0000000000012cf0 __funlockfile()  :0
112:  2 0x0000000004ba5714 a2b_edge_mod_mp_a2b_ord2_()  /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/FV3/atmos_cubed_sphere/model/a2b_edge.F90:382
112:  3 0x0000000002bccce6 L_dyn_core_mod_mp_adv_pe__1630__par_loop0_2_108()  /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/FV3/atmos_cubed_sphere/model/dyn_core.F90:1665
112:  4 0x000000000013fbb3 __kmp_invoke_microtask()  ???:0
112:  5 0x00000000000bbfac __kmp_fork_call()  /nfs/site/proj/openmp/promo/20211013/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxilab153/../../src/kmp_runtime.cpp:2111
112:  6 0x000000000007dcb5 __kmpc_fork_call()  /nfs/site/proj/openmp/promo/20211013/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxilab153/../../src/kmp_csupport.cpp:358
112:  7 0x0000000002bc674f dyn_core_mod_mp_adv_pe_()  /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/FV3/atmos_cubed_sphere/model/dyn_core.F90:1630
112:  8 0x0000000002b689ea dyn_core_mod_mp_dyn_core_()  /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/FV3/atmos_cubed_sphere/model/dyn_core.F90:1280
112:  9 0x0000000002ce48d4 fv_dynamics_mod_mp_fv_dynamics_()  /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/FV3/atmos_cubed_sphere/model/fv_dynamics.F90:683
112: 10 0x00000000028bd928 atmosphere_mod_mp_atmosphere_dynamics_()  /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/FV3/atmos_cubed_sphere/driver/fvGFS/atmosphere.F90:683
112: 11 0x00000000020b079c atmos_model_mod_mp_update_atmos_model_dynamics_()  /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/FV3/atmos_model.F90:880
112: 12 0x0000000001b4014c module_fcst_grid_comp_mp_fcst_run_phase_1_()  /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/FV3/module_fcst_grid_comp.F90:1330
112: 13 0x0000000000aa2644 ESMCI::FTable::callVFuncPtr()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:2167
112: 14 0x0000000000aa61ef ESMCI_FTableCallEntryPointVMHop()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:824
112: 15 0x000000000094dbea ESMCI::VMK::enter()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Infrastructure/VM/src/ESMCI_VMKernel.C:1247
112: 16 0x000000000121eeaf ESMCI::VM::enter()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Infrastructure/VM/src/ESMCI_VM.C:1216
112: 17 0x0000000000aa3a8a c_esmc_ftablecallentrypointvm_()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:981
112: 18 0x0000000000970d50 esmf_compmod_mp_esmf_compexecute_()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMF_Comp.F90:1252
112: 19 0x0000000000ca5351 esmf_gridcompmod_mp_esmf_gridcomprun_()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMF_GridComp.F90:1903
112: 20 0x0000000001b0b54e fv3atm_cap_mod_mp_modeladvance_phase1_()  /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/FV3/fv3_cap.F90:1077
112: 21 0x0000000001b0a615 fv3atm_cap_mod_mp_modeladvance_()  /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/FV3/fv3_cap.F90:1026
112: 22 0x00000000006aba58 ESMCI::MethodElement::execute()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:377
112: 23 0x00000000006ab9ba ESMCI::MethodTable::execute()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:563
112: 24 0x00000000006aa582 c_esmc_methodtableexecute_()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:317
112: 25 0x000000000047c492 esmf_attachmethodsmod_mp_esmf_methodgridcompexecute_()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/AttachMethods/src/ESMF_AttachMethods.F90:1287
112: 26 0x0000000004e0e71d nuopc_modelbase_mp_routine_run_()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/addon/NUOPC/src/NUOPC_ModelBase.F90:2212
112: 27 0x0000000000aa2644 ESMCI::FTable::callVFuncPtr()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:2167
112: 28 0x0000000000aa61ef ESMCI_FTableCallEntryPointVMHop()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:824
112: 29 0x000000000094d9da ESMCI::VMK::enter()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Infrastructure/VM/src/ESMCI_VMKernel.C:2501
112: 30 0x000000000121eeaf ESMCI::VM::enter()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Infrastructure/VM/src/ESMCI_VM.C:1216
112: 31 0x0000000000aa3a8a c_esmc_ftablecallentrypointvm_()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:981
112: 32 0x0000000000970d50 esmf_compmod_mp_esmf_compexecute_()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMF_Comp.F90:1252
112: 33 0x0000000000ca5351 esmf_gridcompmod_mp_esmf_gridcomprun_()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMF_GridComp.F90:1903
112: 34 0x00000000008d1317 nuopc_driver_mp_routine_executegridcomp_()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:3694
112: 35 0x00000000008d0b6a nuopc_driver_mp_executerunsequence_()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:3940
112: 36 0x00000000006aba58 ESMCI::MethodElement::execute()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:377
112: 37 0x00000000006ab9ba ESMCI::MethodTable::execute()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:563
112: 38 0x00000000006aa582 c_esmc_methodtableexecute_()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:317
112: 39 0x000000000047c492 esmf_attachmethodsmod_mp_esmf_methodgridcompexecute_()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/AttachMethods/src/ESMF_AttachMethods.F90:1287
112: 40 0x00000000008cdbb2 nuopc_driver_mp_routine_run_()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:3615
112: 41 0x0000000000aa2644 ESMCI::FTable::callVFuncPtr()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:2167
112: 42 0x0000000000aa61ef ESMCI_FTableCallEntryPointVMHop()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:824
112: 43 0x000000000094d9da ESMCI::VMK::enter()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Infrastructure/VM/src/ESMCI_VMKernel.C:2501
112: 44 0x000000000121eeaf ESMCI::VM::enter()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Infrastructure/VM/src/ESMCI_VM.C:1216
112: 45 0x0000000000aa3a8a c_esmc_ftablecallentrypointvm_()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:981
112: 46 0x0000000000970d50 esmf_compmod_mp_esmf_compexecute_()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMF_Comp.F90:1252
112: 47 0x0000000000ca5351 esmf_gridcompmod_mp_esmf_gridcomprun_()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMF_GridComp.F90:1903
112: 48 0x000000000042fae6 MAIN__()  /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/driver/UFS.F90:406
112: 49 0x000000000042bfa2 main()  ???:0
112: 50 0x000000000003ad85 __libc_start_main()  ???:0
112: 51 0x000000000042beae _start()  ???:0
112: =================================
112: forrtl: error (75): floating point exception
112: Image              PC                Routine            Line        Source
112: fv3.exe            000000000C1EE34B  Unknown               Unknown  Unknown
112: libpthread-2.28.s  0000150AC4D0BCF0  Unknown               Unknown  Unknown
112: fv3.exe            0000000004BA5714  a2b_edge_mod_mp_a         382  a2b_edge.F90
112: fv3.exe            0000000002BCCCE6  dyn_core_mod_mp_a        1665  dyn_core.F90
112: libiomp5.so        0000150AC7D74BB3  __kmp_invoke_micr     Unknown  Unknown
112: libiomp5.so        0000150AC7CF0FAC  __kmp_fork_call       Unknown  Unknown
112: libiomp5.so        0000150AC7CB2CB5  __kmpc_fork_call      Unknown  Unknown
112: fv3.exe            0000000002BC674F  dyn_core_mod_mp_a        1630  dyn_core.F90
112: fv3.exe            0000000002B689EA  dyn_core_mod_mp_d        1280  dyn_core.F90
112: fv3.exe            0000000002CE48D4  fv_dynamics_mod_m         683  fv_dynamics.F90
112: fv3.exe            00000000028BD928  atmosphere_mod_mp         683  atmosphere.F90
112: fv3.exe            00000000020B079C  atmos_model_mod_m         880  atmos_model.F90
112: fv3.exe            0000000001B4014C  module_fcst_grid_        1330  module_fcst_grid_comp.F90
112: fv3.exe            0000000000AA2644  Unknown               Unknown  Unknown
112: fv3.exe            0000000000AA61EF  Unknown               Unknown  Unknown
112: fv3.exe            000000000094DBEA  Unknown               Unknown  Unknown
112: fv3.exe            000000000121EEAF  Unknown               Unknown  Unknown
112: fv3.exe            0000000000AA3A8A  Unknown               Unknown  Unknown
112: fv3.exe            0000000000970D50  Unknown               Unknown  Unknown
112: fv3.exe            0000000000CA5351  Unknown               Unknown  Unknown
112: fv3.exe            0000000001B0B54E  fv3atm_cap_mod_mp        1077  fv3_cap.F90
112: fv3.exe            0000000001B0A615  fv3atm_cap_mod_mp        1026  fv3_cap.F90
112: fv3.exe            00000000006ABA58  Unknown               Unknown  Unknown
112: fv3.exe            00000000006AB9BA  Unknown               Unknown  Unknown
112: fv3.exe            00000000006AA582  Unknown               Unknown  Unknown
112: fv3.exe            000000000047C492  Unknown               Unknown  Unknown
112: fv3.exe            0000000004E0E71D  Unknown               Unknown  Unknown
112: fv3.exe            0000000000AA2644  Unknown               Unknown  Unknown
112: fv3.exe            0000000000AA61EF  Unknown               Unknown  Unknown
112: fv3.exe            000000000094D9DA  Unknown               Unknown  Unknown
112: fv3.exe            000000000121EEAF  Unknown               Unknown  Unknown
112: fv3.exe            0000000000AA3A8A  Unknown               Unknown  Unknown
112: fv3.exe            0000000000970D50  Unknown               Unknown  Unknown
112: fv3.exe            0000000000CA5351  Unknown               Unknown  Unknown
112: fv3.exe            00000000008D1317  Unknown               Unknown  Unknown
112: fv3.exe            00000000008D0B6A  Unknown               Unknown  Unknown
112: fv3.exe            00000000006ABA58  Unknown               Unknown  Unknown
112: fv3.exe            00000000006AB9BA  Unknown               Unknown  Unknown
112: fv3.exe            00000000006AA582  Unknown               Unknown  Unknown
112: fv3.exe            000000000047C492  Unknown               Unknown  Unknown
112: fv3.exe            00000000008CDBB2  Unknown               Unknown  Unknown
112: fv3.exe            0000000000AA2644  Unknown               Unknown  Unknown
112: fv3.exe            0000000000AA61EF  Unknown               Unknown  Unknown
112: fv3.exe            000000000094D9DA  Unknown               Unknown  Unknown
112: fv3.exe            000000000121EEAF  Unknown               Unknown  Unknown
112: fv3.exe            0000000000AA3A8A  Unknown               Unknown  Unknown
112: fv3.exe            0000000000970D50  Unknown               Unknown  Unknown
112: fv3.exe            0000000000CA5351  Unknown               Unknown  Unknown
112: fv3.exe            000000000042FAE6  MAIN__                    406  UFS.F90
112: fv3.exe            000000000042BFA2  Unknown               Unknown  Unknown
112: libc-2.28.so       0000150AC4756D85  __libc_start_main     Unknown  Unknown
112: fv3.exe            000000000042BEAE  Unknown               Unknown  Unknown

The crash is a floating-point exception. There are only additions and multiplications, so the exception is probably from a NaN. This could be due to uninitialized memory, or due to not filling boundary conditions (which are initialized with signalling NaN).

Crashes seem to start after #344 was merged. If so, that PR shouldn't have been merged; the regression test system should've detected this problem. Unfortunately, the ufs-weather-model regression test system is presently unable to detect the difference between a crash and a test's results changing. A fix for the regression test system bug is being tested now.

Unfortunately, we're stuck with broken authoritative branches until this bug is fixed.

From skimming the changes in #344, my best guess is that some parts of the omga array are uninitialized for regional cases due to removing the initialization loop. I haven't had a chance to test that hypothesis yet.

To Reproduce

  1. Set up on Hera the ufs-weather-model regression test system to not retry jobs, and not delete logs or run directories.
  2. Run all ufs-weather-model regression tests that have both "conus13km" and "debug" in their name.
  3. Check for floating point exceptions in failed tests before the regression test system deletes the logs.

The fix for the regression test system is in this PR:

  • https://github.com/ufs-community/ufs-weather-model/pull/2335

That is being tested now. Once it's merged, model crashes will be detectable in regression tests once again.

Expected behavior Model runs to completion when compiled in debug mode.

System Environment UFS Weather Model regression test system with Intel compiler on Hera. That's Intel 2021.5.0 with IMPI 2021.5.1 and FMS 2023.04 using Spack Stack 1.6.0.

Here's the uname -a output from a login node:

Linux hfe09 4.18.0-477.27.1.el8_8.x86_64 #1 SMP Wed Sep 20 15:55:39 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Additional context Can't think of anything.

SamuelTrahanNOAA avatar Jul 11 '24 16:07 SamuelTrahanNOAA