E3SM
E3SM copied to clipboard
NaN-value issue with Cray compiler CCE/14.0.0 on Crusher: ELM EcosystemBalanceCheckMod.F90
Among 57 e3sm_develper test cases, 21 tests are passed and 36 failed.
Among 36 failed tests, 16 are failed with the issue of "NaN-value". The failed test cases are: ERS.f09_g16.I1850ELMCN.crusher_crayclang ERS.f09_g16.I1850GSWCNPRDCTCBC.crusher_crayclang.elm-vstrd ERS.f19_f19.I1850ELMCN.crusher_crayclang ERS.f19_f19.I20TRELMCN.crusher_crayclang ERS.f19_g16.I1850CNECACNTBC.crusher_crayclang.elm-eca ERS.f19_g16.I1850CNECACTCBC.crusher_crayclang.elm-eca ERS.f19_g16.I1850CNRDCTCBC.crusher_crayclang.elm-rd ERS.f19_g16.I1850GSWCNPECACNTBC.crusher_crayclang.elm-eca_f19_g16_I1850GSWCNPECACNTBC ERS.f19_g16.I1850GSWCNPRDCTCBC.crusher_crayclang.elm-ctc_f19_g16_I1850GSWCNPRDCTCBC ERS.f19_g16.I20TRGSWCNPECACNTBC.crusher_crayclang.elm-eca_f19_g16_I20TRGSWCNPECACNTBC ERS.f19_g16.I20TRGSWCNPRDCTCBC.crusher_crayclang.elm-ctc_f19_g16_I20TRGSWCNPRDCTCBC ERS.r05_r05.IELM.crusher_crayclang.elm-V2_ELM_MOSART_features SMS.ne4_oQU240.F2010.crusher_crayclang.eam-cosplite SMS.r05_r05.I1850ELMCN.crusher_crayclang.elm-qian_1948 SMS_Ld1.hcru_hcru.I1850CRUELMCN.crusher_crayclang SMS_Ly2_P1x1.1x1_smallvilleIA.IELMCNCROP.crusher_crayclang.elm-lulcc_sville
This issue is related to https://github.com/E3SM-Project/E3SM/issues/4623
This error is occurred at the following error check routine in “E3SM/components/elm/src/biogeochem/EcosystemBalanceCheckMod.F90”:
if (abs(errcb_grc(g)) > balance_check_tolerance) then
err_found = .true.
err_index = g
end if
Because errcb_grc(g) contains NaN value, the "if" test should return false, but it goes into true path.
I created a google sheet that summarizes new test results(e3sm_integration) and their test results with color-coded issues.
https://docs.google.com/spreadsheets/d/1I0J9zUXCJufdnlxLSkXaww-w81ME4RLI_0cxKRh6a5U/edit?usp=sharing
@mt5555 Regarding SMS.ne4_oQU240.F2010.crusher_crayclang.eam-cosplite, is cosplite part of the target workload on Frontier?
Otherwise, looks like this issue impacts the I-cases mostly.
We should do some investigation and prioritize NaN issues for the Cray folks to dig into.
Adding -hfp0 to ELM builds is a reasonable global fix. However, ELM folks should not rely on conditional execution based on a NaN value. Although it accidentally works due to IEEE behavior with many compilers, there ought to be a more explicit check.
@peterdschwartz Perhaps you know who might be the best person to address this issue.
Context: ` This error is occurred at the following error check routine in “E3SM/components/elm/src/biogeochem/EcosystemBalanceCheckMod.F90”:
if (abs(errcb_grc(g)) > balance_check_tolerance) then
err_found = .true.
err_index = g
end if
Because errcb_grc(g) contains NaN value, the "if" test should return false, but it goes into true path. `
cc @thorntonpe @rljacob
I agree the practice in ELM of initializing to NAN and then using that in if statements needs to go. Why not initialize to some really large value like 10^8?
@rljacob There is a dedicated parameter spval=1.E+36 that is used for that purpose. Typically arrays will be initialized to spval when they're added to the history field list. So this issue and the DEBUG mode issues are due to variables that don't adhere to this common practice.
For example, below this%errcb is not set to spval but the field above is (this%errcb being the variable responsible for the error in @sarats post).
this%tcs_month_end(begg:endg) = spval
call hist_addfld1d (fname='TCS_MONTH_END', units='mm', &
avgflag='I', long_name='total carbon storage at the end of a month', &
ptr_lnd=this%tcs_month_end)
call hist_addfld1d (fname='CMASS_BALANCE_ERROR', units='gC/m^2', &
avgflag='A', long_name='Gridcell carbon mass balance error', &
ptr_lnd=this%errcb)
I could make a branch to fix these gaps with the initialization and/or get rid of the NaN initializations altogether. I don't have access to Crusher or that cray compiler currently but the fpe0 debug flag with intel should catch these implicit NaN instances.
Getting rid of the NaN inits would be a good thing.
HPE recommends to use -hfp0 option whenever a strict IEEE-compliance is required. I have tested several test cases with the option and with -O0, -O1, and -O2 optimization, and found no performance difference from using the option.