ufs-weather-model icon indicating copy to clipboard operation
ufs-weather-model copied to clipboard

Add HAFS related debug and threading regression tests

Open BinLiu-NOAA opened this issue 3 years ago • 17 comments

Description

Add HAFS related debug, threading regression tests.

Solution

Add HAFS related debug, threading regression tests based on existing HAFS regression tests.

Alternatives

Depend upon existing debug, threading regression tests from other applications (mrweather, srweather, S2S, etc.)

Related to

  • associated with ufs-community/ufs-weather-model/pull/<pr_number>

BinLiu-NOAA avatar Aug 24 '21 17:08 BinLiu-NOAA

@BinLiu-NOAA Can we close this issue?

junwang-noaa avatar Dec 30 '21 01:12 junwang-noaa

Hi @junwang-noaa @BinLiu-NOAA @hyunsookkim-NOAA I am working on the debugging regression test for HAFS ATM-OCN coupling using UFSATM and HYCOM. I am using the following compiler flags and macros when building HYCOM with Intel compilers mpiifort -DDEBUG -DENDIAN_IO -DEOS_17T -DEOS_SIG2 -DESPC_COUPLE -DIA32 -DMPI -DNAN2003 -DREAL8 -DRELO -DSERIAL_IO -DTIME -g -traceback -xSSE4.2 -mcmodel=small -r8 -O0 -check -check noarg_temp_created -check nopointer -fpe0 -ftrapuv -link_mpi=dbg -init=snan,arrays

I found two bugs related to variables being initialized with NaNs (most likely outside the bounds of the computations and therefore not crashing the coupled model)

geopar.F90 LINE:677 During the calculation of kapi, util2 is not set outside of (1:ii,1:jj). The copy from util2 to kapi crashes due to snan. I tested by adding util2(:,:) = -99 to fill util2 before calculating kapi and this fixed the bug.

mod_momtum.F90 LINE:1697 dpmx is only initialized for (1-margin:ii+margin,,1-margin). Later calculations in this subroutine fail when using dpmx outside of these bounds due to snan. I tested by initializing the entire dpmx array with dpmx(:,:)=2.*cutoff and fixed the bug.

I need to review these bug fixes with Navy and Hyun-Sook.

One more note is that the -DDEBUG run for hafs_regional_atm_ocn exceeded the default wallclock time. I believe the run took 110 minutes (the default being 30). How do you want to handle this? Maybe the debugging checks I added to HYCOM are too strict?

danrosen25 avatar Jun 02 '22 15:06 danrosen25

@danrosen25 Thanks for looking into the debug issues, which need to be address for operational models. The debug test does run much slower, we usually cut the forecast length to make sure it fits in 30mins wall clock window.

junwang-noaa avatar Jun 02 '22 15:06 junwang-noaa

@junwang-noaa I believe the HAFS atm-ocn configuration completed about 1.5 hours within the 30 minute window, would that be sufficient?

danrosen25 avatar Jun 02 '22 15:06 danrosen25

That's fine. For your reference, we run 1 hr forecast for global control debug test.

junwang-noaa avatar Jun 02 '22 15:06 junwang-noaa

@danrosen25 @junwang-noaa @BinLiu-NOAA , The geopar.F90 code includes "mod_cb_arrays" which initializes variables, including util2. For dpmx, it is definitely initialized insize the "momtum" subroutine. But, it is initialied with r_init in the "momtum_init" in mod_momtum.F90. Can you make it consistent? In other words, use r_init in place of 2*cutoff.

hyunsookkim-NOAA avatar Jun 02 '22 16:06 hyunsookkim-NOAA

@hyunsookkim-NOAA Both util2 and dpmx are already initialized with r_init but r_init is set to a nan (due to DNAN2003). Rethinking what I said earlier the floating point check is catching the quiet nan and reporting floating point invalid. Either way we can't initialize to a NaN (r_init) and use the values on the righthand side of an equation. util2 and dpmx need values before they're used.

https://github.com/hafs-community/HYCOM-src/blob/support/HAFS/mod_dimensions.F90#L266

danrosen25 avatar Jun 02 '22 16:06 danrosen25

I created a branch for the HYCOM debugging fix https://github.com/hafs-community/HYCOM-src/tree/fix/debug

danrosen25 avatar Jun 02 '22 17:06 danrosen25

@junwang-noaa @BinLiu-NOAA Hi Jun, should I be using the compiler flags set for debugging here? https://github.com/ufs-community/ufs-weather-model/blob/develop/cmake/Intel.cmake#L10 -O0 -check -check noarg_temp_created -check nopointer -warn -warn noerrors -fp-stack-check -fstack-protector-all -fpe0 -debug -ftrapuv -init=snan,arrays or should I be using -check all or some combination of both?

danrosen25 avatar Jun 06 '22 19:06 danrosen25

Please use -check all if you can.

junwang-noaa avatar Jun 06 '22 19:06 junwang-noaa

@junwang-noaa @BinLiu-NOAA Hi Jun, should I be using the compiler flags set for debugging here? https://github.com/ufs-community/ufs-weather-model/blob/develop/cmake/Intel.cmake#L10 -O0 -check -check noarg_temp_created -check nopointer -warn -warn noerrors -fp-stack-check -fstack-protector-all -fpe0 -debug -ftrapuv -init=snan,arrays or should I be using -check all or some combination of both?

@junwang-noaa, is there a specific reason that ufs-weather-model currently does not use "-check all" build option for the debug build? Thanks!

BinLiu-NOAA avatar Jun 06 '22 20:06 BinLiu-NOAA

@BinLiu-NOAA From our testing with GFSv16, We are actually using -check all,nopointer,noarg_temp_created for debug jobs. There is an omp parallel issue where multiple threads allocate the same array without the nopointer check, this is related to the impi version 2018. Without noarg_temp_created the log file is very large due to the warnings on creation of a temporary array, it may cause performance issue on fv3 and -check all could force the model to terminate as we saw in the ww3. Since we are using Impi 2021/2022, I suggested to try -check all first, if it does not work, then -check all,noarg_temp_created. After impui2019, the issue with -check nopointer should be fixed, but we haven't tested it yet. Thanks

junwang-noaa avatar Jun 07 '22 00:06 junwang-noaa

@junwang-noaa, Thanks for the background and information! @danrosen25 and @BijuThomas-NOAA, appreciate it if you could follow Jun's suggestion when testing HAFS-HYCOM debug build. Thanks!

BinLiu-NOAA avatar Jun 07 '22 10:06 BinLiu-NOAA

I tested HYCOM with -check all or more specifically -DDEBUG -DENDIAN_IO -DEOS_17T -DEOS_SIG2 -DESPC_COUPLE -DIA32 -DMPI -DNAN2003 -DREAL8 -DRELO -DSERIAL_IO -DTIME -g -traceback -xSSE4.2 -mcmodel=small -r8 -O0 -check all -fpe0 -ftrapuv -link_mpi=dbg -init=snan,arrays. The hafs_regional_atm_ocn regression test finished but the rt_regional_static_atm_ocn HAFS test failed due to insufficient virtual memory, which I think is fine.

I've only modified the HYCOM build. It looks like set(CMAKE_Fortran_FLAGS_DEBUG "-O0 -check all -fpe0 -ftrapuv -link_mpi=dbg -init=snan,arrays") Does this look good or would you like to inherit CMAKE_Fortran_FLAGS_DEBUG from UFS also?

One more FYI for @BinLiu-NOAA @BijuThomas-NOAA, the Navy fixed the code and I cherry-picked their revision instead of using the code I committed. The fix/debug branch for HYCOM has been updated.

danrosen25 avatar Jun 07 '22 19:06 danrosen25

Hi @BijuThomas-NOAA @BinLiu-NOAA Here are the branches I made with the HYCOM debug changes, they are branched from feature/hafsv0.3_final https://github.com/hafs-community/ufs-weather-model/tree/fix/debug https://github.com/hafs-community/HYCOM-src/tree/fix/debug Once the 1 hr regression test is created and tested then we should merge these branches back.

danrosen25 avatar Jun 08 '22 19:06 danrosen25

I see the following error with ufs-weather-model RT (Orion):
This is a regional storm following single nest atm coupled with HyCOM and WW3, compiled the HyCOM with debug flag (-check all, -g, -traceback).

srun --label -n 300 ./fv3.exe 60: [Orion-05-01:323295:0:323295] Caught signal 8 (Floating point exception: floating-point invalid operation) 63: 0 0x000000000825fa0d fv_moving_nest_mod_mp_calc_nest_halo_weights() /work/noaa/hwrf/save/bthomas/hafs.v0.3.0_hycom_debug_ufs_rt/sorc/hafs_forecast.fd/FV3/atmos_cubed_sphere/moving_nest/fv_moving_nest.F90:2876_

Here is the run directory on Orion: /work/noaa/hwrf/scrub/bthomas/stmp/bthomas/FV3_RT/rt_446906

BijuThomas-NOAA avatar Aug 25 '22 19:08 BijuThomas-NOAA

@DeniseWorthen and @BinLiu-NOAA, the run was with ATM built with DEBUG=ON. However, the run with a static regional domain coupled with HyCOM succeeded in a 6-hour run. The reported issue is with the moving nest run. Do you want to do a test in which the ATM component is built without DEBUG flags, but HyCOM with DEBUG flags? Thanks,

BijuThomas-NOAA avatar Sep 09 '22 17:09 BijuThomas-NOAA

This feature is requested by HAFS v1 implementation. Will check with Biju on DEBUG mode.

junwang-noaa avatar Dec 19 '22 18:12 junwang-noaa

Tested again with DEBUG mode using the latest develop branch. Ended up with the error:

Caught signal 8 (Floating point exception: floating-point invalid operation)
 0 0x00000000063e43d7 fv_moving_nest_mod_mp_calc_nest_halo_weights_()  /work/noaa/hwrf/save/bthomas/hafsv1_phase3/sorc/hafs_forecast.fd/FV3/moving_nest/fv_moving_nest.F90:2510

In my previous tests, single domain configuration(no nests) succeeded with DEBUG mode. I think Bill is looking into Floating point exception.

BijuThomas-NOAA avatar Dec 21 '22 15:12 BijuThomas-NOAA

@BinLiu-NOAA Is the debug mode working for HAFS?

junwang-noaa avatar Jan 19 '23 17:01 junwang-noaa

@junwang-noaa, yes, I believe so. @BijuThomas-NOAA and @wramstrom have fixed and tested this capability recently. And I think the DEBUG build now works for HAFS. We do plan to add or convert one HAFS RT to test the debug build capability. Thanks!

BinLiu-NOAA avatar Jan 19 '23 17:01 BinLiu-NOAA

A debug test was added in #1606. @BinLiu-NOAA Do you run with threads in your parallel run?

junwang-noaa avatar Mar 20 '23 13:03 junwang-noaa

@BinLiu-NOAA Can we close this issue?

junwang-noaa avatar Oct 23 '23 15:10 junwang-noaa

@junwang-noaa, please help to close this issue. Thanks!

BinLiu-NOAA avatar Oct 23 '23 18:10 BinLiu-NOAA