ufs-weather-model
ufs-weather-model copied to clipboard
Add HAFS related debug and threading regression tests
Description
Add HAFS related debug, threading regression tests.
Solution
Add HAFS related debug, threading regression tests based on existing HAFS regression tests.
Alternatives
Depend upon existing debug, threading regression tests from other applications (mrweather, srweather, S2S, etc.)
Related to
- associated with ufs-community/ufs-weather-model/pull/<pr_number>
@BinLiu-NOAA Can we close this issue?
Hi @junwang-noaa @BinLiu-NOAA @hyunsookkim-NOAA
I am working on the debugging regression test for HAFS ATM-OCN coupling using UFSATM and HYCOM. I am using the following compiler flags and macros when building HYCOM with Intel compilers mpiifort -DDEBUG -DENDIAN_IO -DEOS_17T -DEOS_SIG2 -DESPC_COUPLE -DIA32 -DMPI -DNAN2003 -DREAL8 -DRELO -DSERIAL_IO -DTIME -g -traceback -xSSE4.2 -mcmodel=small -r8 -O0 -check -check noarg_temp_created -check nopointer -fpe0 -ftrapuv -link_mpi=dbg -init=snan,arrays
I found two bugs related to variables being initialized with NaNs (most likely outside the bounds of the computations and therefore not crashing the coupled model)
geopar.F90 LINE:677
During the calculation of kapi, util2 is not set outside of (1:ii,1:jj). The copy from util2 to kapi crashes due to snan. I tested by adding util2(:,:) = -99
to fill util2 before calculating kapi and this fixed the bug.
mod_momtum.F90 LINE:1697
dpmx is only initialized for (1-margin:ii+margin,,1-margin). Later calculations in this subroutine fail when using dpmx outside of these bounds due to snan. I tested by initializing the entire dpmx array with dpmx(:,:)=2.*cutoff
and fixed the bug.
I need to review these bug fixes with Navy and Hyun-Sook.
One more note is that the -DDEBUG run for hafs_regional_atm_ocn exceeded the default wallclock time. I believe the run took 110 minutes (the default being 30). How do you want to handle this? Maybe the debugging checks I added to HYCOM are too strict?
@danrosen25 Thanks for looking into the debug issues, which need to be address for operational models. The debug test does run much slower, we usually cut the forecast length to make sure it fits in 30mins wall clock window.
@junwang-noaa I believe the HAFS atm-ocn configuration completed about 1.5 hours within the 30 minute window, would that be sufficient?
That's fine. For your reference, we run 1 hr forecast for global control debug test.
@danrosen25 @junwang-noaa @BinLiu-NOAA , The geopar.F90 code includes "mod_cb_arrays" which initializes variables, including util2. For dpmx, it is definitely initialized insize the "momtum" subroutine. But, it is initialied with r_init in the "momtum_init" in mod_momtum.F90. Can you make it consistent? In other words, use r_init in place of 2*cutoff.
@hyunsookkim-NOAA
Both util2 and dpmx are already initialized with r_init but r_init is set to a nan (due to DNAN2003
). Rethinking what I said earlier the floating point check is catching the quiet nan and reporting floating point invalid. Either way we can't initialize to a NaN (r_init) and use the values on the righthand side of an equation. util2 and dpmx need values before they're used.
https://github.com/hafs-community/HYCOM-src/blob/support/HAFS/mod_dimensions.F90#L266
I created a branch for the HYCOM debugging fix https://github.com/hafs-community/HYCOM-src/tree/fix/debug
@junwang-noaa @BinLiu-NOAA
Hi Jun, should I be using the compiler flags set for debugging here?
https://github.com/ufs-community/ufs-weather-model/blob/develop/cmake/Intel.cmake#L10
-O0 -check -check noarg_temp_created -check nopointer -warn -warn noerrors -fp-stack-check -fstack-protector-all -fpe0 -debug -ftrapuv -init=snan,arrays
or should I be using
-check all
or some combination of both?
Please use -check all
if you can.
@junwang-noaa @BinLiu-NOAA Hi Jun, should I be using the compiler flags set for debugging here? https://github.com/ufs-community/ufs-weather-model/blob/develop/cmake/Intel.cmake#L10
-O0 -check -check noarg_temp_created -check nopointer -warn -warn noerrors -fp-stack-check -fstack-protector-all -fpe0 -debug -ftrapuv -init=snan,arrays
or should I be using-check all
or some combination of both?
@junwang-noaa, is there a specific reason that ufs-weather-model currently does not use "-check all" build option for the debug build? Thanks!
@BinLiu-NOAA From our testing with GFSv16, We are actually using -check all,nopointer,noarg_temp_created for debug jobs. There is an omp parallel issue where multiple threads allocate the same array without the nopointer check, this is related to the impi version 2018. Without noarg_temp_created the log file is very large due to the warnings on creation of a temporary array, it may cause performance issue on fv3 and -check all could force the model to terminate as we saw in the ww3. Since we are using Impi 2021/2022, I suggested to try -check all first, if it does not work, then -check all,noarg_temp_created. After impui2019, the issue with -check nopointer should be fixed, but we haven't tested it yet. Thanks
@junwang-noaa, Thanks for the background and information! @danrosen25 and @BijuThomas-NOAA, appreciate it if you could follow Jun's suggestion when testing HAFS-HYCOM debug build. Thanks!
I tested HYCOM with -check all
or more specifically -DDEBUG -DENDIAN_IO -DEOS_17T -DEOS_SIG2 -DESPC_COUPLE -DIA32 -DMPI -DNAN2003 -DREAL8 -DRELO -DSERIAL_IO -DTIME -g -traceback -xSSE4.2 -mcmodel=small -r8 -O0 -check all -fpe0 -ftrapuv -link_mpi=dbg -init=snan,arrays
. The hafs_regional_atm_ocn regression test finished but the rt_regional_static_atm_ocn HAFS test failed due to insufficient virtual memory, which I think is fine.
I've only modified the HYCOM build. It looks like
set(CMAKE_Fortran_FLAGS_DEBUG "-O0 -check all -fpe0 -ftrapuv -link_mpi=dbg -init=snan,arrays")
Does this look good or would you like to inherit CMAKE_Fortran_FLAGS_DEBUG
from UFS also?
One more FYI for @BinLiu-NOAA @BijuThomas-NOAA, the Navy fixed the code and I cherry-picked their revision instead of using the code I committed. The fix/debug branch for HYCOM has been updated.
Hi @BijuThomas-NOAA @BinLiu-NOAA
Here are the branches I made with the HYCOM debug changes, they are branched from feature/hafsv0.3_final
https://github.com/hafs-community/ufs-weather-model/tree/fix/debug
https://github.com/hafs-community/HYCOM-src/tree/fix/debug
Once the 1 hr regression test is created and tested then we should merge these branches back.
I see the following error with ufs-weather-model RT (Orion):
This is a regional storm following single nest atm coupled with HyCOM and WW3, compiled the HyCOM with
debug flag (-check all, -g, -traceback).
srun --label -n 300 ./fv3.exe 60: [Orion-05-01:323295:0:323295] Caught signal 8 (Floating point exception: floating-point invalid operation) 63: 0 0x000000000825fa0d fv_moving_nest_mod_mp_calc_nest_halo_weights() /work/noaa/hwrf/save/bthomas/hafs.v0.3.0_hycom_debug_ufs_rt/sorc/hafs_forecast.fd/FV3/atmos_cubed_sphere/moving_nest/fv_moving_nest.F90:2876_
Here is the run directory on Orion: /work/noaa/hwrf/scrub/bthomas/stmp/bthomas/FV3_RT/rt_446906
@DeniseWorthen and @BinLiu-NOAA, the run was with ATM built with DEBUG=ON. However, the run with a static regional domain coupled with HyCOM succeeded in a 6-hour run. The reported issue is with the moving nest run. Do you want to do a test in which the ATM component is built without DEBUG flags, but HyCOM with DEBUG flags? Thanks,
This feature is requested by HAFS v1 implementation. Will check with Biju on DEBUG mode.
Tested again with DEBUG mode using the latest develop branch. Ended up with the error:
Caught signal 8 (Floating point exception: floating-point invalid operation)
0 0x00000000063e43d7 fv_moving_nest_mod_mp_calc_nest_halo_weights_() /work/noaa/hwrf/save/bthomas/hafsv1_phase3/sorc/hafs_forecast.fd/FV3/moving_nest/fv_moving_nest.F90:2510
In my previous tests, single domain configuration(no nests) succeeded with DEBUG mode. I think Bill is looking into Floating point exception.
@BinLiu-NOAA Is the debug mode working for HAFS?
@junwang-noaa, yes, I believe so. @BijuThomas-NOAA and @wramstrom have fixed and tested this capability recently. And I think the DEBUG build now works for HAFS. We do plan to add or convert one HAFS RT to test the debug build capability. Thanks!
A debug test was added in #1606. @BinLiu-NOAA Do you run with threads in your parallel run?
@BinLiu-NOAA Can we close this issue?
@junwang-noaa, please help to close this issue. Thanks!