MOM6
MOM6 copied to clipboard
MOM6 config working with gnu, but not intel compile
I am running a regional MOM6 configuration that includes sponges (u, v, tracer) and ice. Currently, the simulation fails if the code is compiled with intel but runs (very slowly!) if compiled with gnu.
Backtrace for intel compile, REPRO mode:
FATAL from PE 755: NaN in input field of reproducing_EFP_sum(_2d).
...
forrtl: error (76): Abort trap signal
Image PC Routine Line Source
MOM6 0000000001FBE3E4 Unknown Unknown Unknown
MOM6 000000000196F060 Unknown Unknown Unknown
MOM6 0000000001F943B0 Unknown Unknown Unknown
MOM6 00000000020950C1 Unknown Unknown Unknown
MOM6 0000000001DCC8B2 Unknown Unknown Unknown
MOM6 0000000001DAC52B Unknown Unknown Unknown
MOM6 0000000001DAD9F5 Unknown Unknown Unknown
MOM6 00000000014C8171 mpp_mod_mp_mpp_er 69 mpp_util_mpi.inc
MOM6 0000000000D9F4FC mom_coms_mp_repro 192 MOM_coms.F90
MOM6 0000000000B0737B mom_sum_output_mp 1079 MOM_sum_output.F90
MOM6 00000000004A350E mom_mp_step_mom_ 943 MOM.F90
MOM6 0000000000C8D027 ocean_model_mod_m 583 ocean_model_MOM.F90
MOM6 0000000000CAA999 MAIN__ 1082 coupler_main.F90
MOM6 00000000004015F2 Unknown Unknown Unknown
MOM6 000000000208E8CF Unknown Unknown Unknown
MOM6 00000000004014DA Unknown Unknown Unknown
Backtrace for intel compile, DEBUG mode:
forrtl: error (73): floating divide by zero
Image PC Routine Line Source
MOM6 0000000006469D84 Unknown Unknown Unknown
MOM6 0000000005E19FA0 Unknown Unknown Unknown
MOM6 00000000007A5F20 regrid_edge_value 291 regrid_edge_values.F90
MOM6 00000000010AA203 mom_remapping_mp_ 402 MOM_remapping.F90
MOM6 00000000010A16FB mom_remapping_mp_ 214 MOM_remapping.F90
MOM6 0000000000C25366 mom_ale_sponge_mp 1122 MOM_ALE_sponge.F90
MOM6 0000000003E7FE1B mom_diabatic_driv 996 MOM_diabatic_driver.F90
MOM6 0000000003E3F1F7 mom_diabatic_driv 376 MOM_diabatic_driver.F90
MOM6 0000000000DADD17 mom_mp_step_mom_t 1314 MOM.F90
MOM6 0000000000D956E0 mom_mp_step_mom_ 726 MOM.F90
MOM6 00000000023AA528 ocean_model_mod_m 583 ocean_model_MOM.F90
MOM6 000000000327374B MAIN__ 1082 coupler_main.F90
MOM6 00000000004015F2 Unknown Unknown Unknown
MOM6 0000000006538A0F Unknown Unknown Unknown
MOM6 00000000004014DA Unknown Unknown Unknown
or
forrtl: error (65): floating invalid
Image PC Routine Line Source
MOM6 0000000006466C84 Unknown Unknown Unknown
MOM6 0000000005E16EC0 Unknown Unknown Unknown
MOM6 0000000002B5C866 mom_open_boundary 3820 MOM_open_boundary.F90
MOM6 00000000041CFE2F mom_boundary_upda 136 MOM_boundary_update.F90
MOM6 00000000011DEF26 mom_dynamics_spli 468 MOM_dynamics_split_RK2.F90
MOM6 000000000260C5E4 mom_mp_step_mom_d 1064 MOM.F90
MOM6 0000000002602C6B mom_mp_step_mom_ 784 MOM.F90
MOM6 0000000000579DA4 ocean_model_mod_m 583 ocean_model_MOM.F90
MOM6 00000000022522C7 MAIN__ 1082 coupler_main.F90
MOM6 00000000004015F2 Unknown Unknown Unknown
MOM6 000000000653590F Unknown Unknown Unknown
MOM6 00000000004014DA Unknown Unknown Unknown
I've run your experiment (provided separately) with Intel and GNU, debug and repro, but all appear to be running without any problems.
Could this possibly have already been fixed? What is the hash of your MOM6? I am using 7883f633 (from 6 August).
I was finally able to replicate something here, though not exactly what you are seeing. These are from Intel 18 "debug":
target time yyyy/mm/dd hh:mm:ss= 1980/ 1/ 1 3:57:30
t1, t2, w1, w2= 1 2 0.989035087719298
1.096491228070175E-002
ibuf= 1 2
i1,i2= 1 2
forrtl: error (182): floating invalid - possible uninitialized real/complex variable.
Image PC Routine Line Source
MOM6 0000000005F4B71E Unknown Unknown Unknown
MOM6 0000000005901AC0 Unknown Unknown Unknown
MOM6 000000000588F688 icepack_mechred_m 1239 icepack_mechred.F90
MOM6 00000000058804E8 icepack_mechred_m 386 icepack_mechred.F90
MOM6 0000000002C7151D ice_ridging_mod_m 380 ice_ridge.F90
MOM6 0000000003442050 sis_transport_mp_ 265 SIS_transport.F90
MOM6 0000000000C6A31E sis_dyn_trans_mp_ 615 SIS_dyn_trans.F90
MOM6 0000000002165A33 ice_model_mod_mp_ 327 ice_model.F90
MOM6 0000000002161E56 ice_model_mod_mp_ 156 ice_model.F90
MOM6 0000000000F3E4CC MAIN__ 1034 coupler_main.F90
MOM6 00000000004015EE Unknown Unknown Unknown
MOM6 000000000601E34F Unknown Unknown Unknown
MOM6 00000000004014DA Unknown Unknown Unknown
I am using the GFDL branches of the packages, plus whatever flags are in the mkmf templates, so it's probably slightly different from what you are running.
It took quite a few timesteps for a problem to emerge. (looks like about ~48 timesteps here).
The error here is coming from Icepack, so perhaps this is where the problem is arising. Seemed to happen at the 4-hour mark, although I don't see anything signficant about that timestep. (Coupling steps are 1 hr).
Will keep looking.
Edit: Not exactly sure why, but I now get your error in regrid_edge_values.F90. Could have been an unrelated initialization issue in Icepack which I can no longer replicate.
Somehow this line has resulted in a div-by-zero: https://github.com/NOAA-GFDL/MOM6/blob/e88cdaa785b9fab4e24286031bdd409caf673034/src/ALE/regrid_edge_values.F90#L291
The zero is from h1 + (h2 + h3)
, even though there is a test here to ensure that h2 + h3
is nonzero:
https://github.com/NOAA-GFDL/MOM6/blob/e88cdaa785b9fab4e24286031bdd409caf673034/src/ALE/regrid_edge_values.F90#L266-L276
I don't know how it happened, but somehow h1
is not only negative, but exactly equal to the negative of h2+h3
:
136: h0 0.000000000000000E+000
136: h1 -5215.31998046875
136: h2 2.20500000000000
136: h3 5213.11498046875
136: h2+h3 5215.31998046875
136: h1+(h2+h3) 0.000000000000000E+000
While I could just add an additional check for h1+(h2+h3) == 0
, I am guessing it might be more important to understand how this happened.
I see that the line that is crashing is using the 2018_answers version of the code. There is a newer option within the code (generally set by selecting DEFAULT_2018_ANSWERS = FALSE
or probably more specifically REMAPPING_2018_ANSWERS = False
in your MOM_input files) in which many of the regridding solvers are mathematically equivalent, but have been refactored to make them much more robust and less prone to problems arising from roundoff.
Basically, instead of naively doing Gaussian elimination for matrix solvers or just the standard Thomas algorithm for solving tridiagonal matrices, the newer versions expand the expressions with mathematical expressions to replace differences of numbers that might be very close together other positive definite expressions, effectively replacing 1 / (1*h - (1-gamma)*h)
with 1 / (gamma*h)
, where h
is a positive-definite layer thickness and we can prove that 0 < gamma <= 1
. Because these algorithmic changes do change answers, we have added run-time flags to use these older versions of the solver expressions to reproduce old solutions, but perhaps we need to be more aggressive in discouraging their use.
This raises several questions for me: (1) Was there a deliberate choice to use the older (and less robust) 2018 versions of the expressions in this experiment? (2) If so, why? (3) If not, how is it that this particular line of code is being called? and (4) Is the negative thickness here a symptom of the use of the older and less robust solvers, or is it a symptom of something else that is unrelated to the remapping solvers?
@Hallberg-NOAA I don't think this is happening with DEFAULT_2018_ANSWERS = True
. The division-by-zero is happening on L291, which is outside of the 2018-answer block. I was also able to replicate this error using the flags provided by Liz, which has all of the 2018 flag answers set to false.
The error appears to be coming from the ALE sponge, not the ALE code.
The issue seems to be that mask_z(4,:,:)
has uninitialized values in its halos, which are then propagated to non-halo values of mask_u(4,:,:)
, which eventually lead to the non-monotonic and un-physical reference thicknesses and failures inside the ALE code.
Although mask_z
is initialized here:
https://github.com/NOAA-GFDL/MOM6/blob/90af3cbfeb33e1b58016fcc612ce82ae70cf14c9/src/parameterizations/vertical/MOM_ALE_sponge.F90#L907
the horiz_interp_and_extrap_tracers
function call on the next line contains a deallocation and re-allocation of mask_z
(among other variables). When that happens, we should assume that the values get wiped out and replaced with random numbers, although the actual behavior is going to depend on the compiler.
In this case, GCC is probably just re-using the old value, which is why it still works, whereas Intel is allocating a new part of the heap, with random model-crashing values.
In any case, we have to assume that the array is uninitialized, and we either need to re-initialize these values, or we need avoid re-allocating the arrays inside of the function.
Having said all that, even if I re-initialize this array, I see the icepack error later in the run (at ~hour 12) so we are not yet through this problem.
Following up on this issue:
- PR #1495 appears to resolve the original issue in
regrid_edges
- After fixing this, I can now run the model for ~12 hours. After that, I can now consistently replicating the error in Icepack which I saw in my original tests.
- The SIS2 PR #163 (and possibly MOM6 #1493) seem to help address some other SIS2 initialization issues which help to reproduce the icepack error, although I don't think they are related to the error reported in icepack
I have looked further into this error, and it seems that the SIS2 field trcrn(1,:)
has uninitialized values inside it. These get passed into the icepack library, where the bad initialization is detected.
I do not believe that this is due to icepack, but is rather due to SIS2. I believe something is happening in this code block:
https://github.com/NOAA-GFDL/SIS2/blob/ae959ed3efb31dc4f46a6c07fa56e74885bbdcb0/src/ice_ridge.F90#L324-L349
I am not very familiar with SIS2, but I will keep looking and see if anything makes sense here. There is a lot of pointer manipulation going on, so it's possible that something is not being set properly.
I've looked at this more closely. This seems to be a bug in SIS2.
The TrReg
(tracer registry) always starts with the Tr_ice_enth_ptr
, but for reasons that I cannot understand, it only fills the first category:
324 ntrcr = 0
...
327 ntrcr=ntrcr+1
328 trcrn(ntrcr,1) = Tr_ice_enth_ptr(i,j,1,1) ! surface temperature? this is taken from th e thinnest ice category
...
331 do k=1,nL_ice ; ... ; enddo
337 do k=1,nL_snow ; ... ; enddo
343 do k=1,nL_ice ; ... ; enddo
If I just add this:
trcrn(ntrcr,2:nCat) = 0.
then the model appears to run (so far at least; up to day 2) although I am not sure if this is the preferred solution. We should talk more on Monday.
This commit appears to have fixed the issue: https://github.com/NOAA-GFDL/SIS2/commit/fa3f6b19920684a4bd172a05d1824aca18a9ff56
(This was in PR https://github.com/NOAA-GFDL/SIS2/pull/179)
I feel comfortable that all of the issues here have been resolved, so I will close the issue.