MOM6 icon indicating copy to clipboard operation
MOM6 copied to clipboard

MOM6 config working with gnu, but not intel compile

Open amoebaliz opened this issue 2 years ago • 8 comments

I am running a regional MOM6 configuration that includes sponges (u, v, tracer) and ice. Currently, the simulation fails if the code is compiled with intel but runs (very slowly!) if compiled with gnu.

Backtrace for intel compile, REPRO mode:

FATAL from PE   755: NaN in input field of reproducing_EFP_sum(_2d).
...
forrtl: error (76): Abort trap signal
Image              PC                Routine            Line        Source
MOM6               0000000001FBE3E4  Unknown               Unknown  Unknown
MOM6               000000000196F060  Unknown               Unknown  Unknown
MOM6               0000000001F943B0  Unknown               Unknown  Unknown
MOM6               00000000020950C1  Unknown               Unknown  Unknown
MOM6               0000000001DCC8B2  Unknown               Unknown  Unknown
MOM6               0000000001DAC52B  Unknown               Unknown  Unknown
MOM6               0000000001DAD9F5  Unknown               Unknown  Unknown
MOM6               00000000014C8171  mpp_mod_mp_mpp_er          69  mpp_util_mpi.inc
MOM6               0000000000D9F4FC  mom_coms_mp_repro         192  MOM_coms.F90
MOM6               0000000000B0737B  mom_sum_output_mp        1079  MOM_sum_output.F90
MOM6               00000000004A350E  mom_mp_step_mom_          943  MOM.F90
MOM6               0000000000C8D027  ocean_model_mod_m         583  ocean_model_MOM.F90
MOM6               0000000000CAA999  MAIN__                   1082  coupler_main.F90
MOM6               00000000004015F2  Unknown               Unknown  Unknown
MOM6               000000000208E8CF  Unknown               Unknown  Unknown
MOM6               00000000004014DA  Unknown               Unknown  Unknown

Backtrace for intel compile, DEBUG mode:

forrtl: error (73): floating divide by zero
Image              PC                Routine            Line        Source
MOM6               0000000006469D84  Unknown               Unknown  Unknown
MOM6               0000000005E19FA0  Unknown               Unknown  Unknown
MOM6               00000000007A5F20  regrid_edge_value         291  regrid_edge_values.F90
MOM6               00000000010AA203  mom_remapping_mp_         402  MOM_remapping.F90
MOM6               00000000010A16FB  mom_remapping_mp_         214  MOM_remapping.F90
MOM6               0000000000C25366  mom_ale_sponge_mp        1122  MOM_ALE_sponge.F90
MOM6               0000000003E7FE1B  mom_diabatic_driv         996  MOM_diabatic_driver.F90
MOM6               0000000003E3F1F7  mom_diabatic_driv         376  MOM_diabatic_driver.F90
MOM6               0000000000DADD17  mom_mp_step_mom_t        1314  MOM.F90
MOM6               0000000000D956E0  mom_mp_step_mom_          726  MOM.F90
MOM6               00000000023AA528  ocean_model_mod_m         583  ocean_model_MOM.F90
MOM6               000000000327374B  MAIN__                   1082  coupler_main.F90
MOM6               00000000004015F2  Unknown               Unknown  Unknown
MOM6               0000000006538A0F  Unknown               Unknown  Unknown
MOM6               00000000004014DA  Unknown               Unknown  Unknown

or

forrtl: error (65): floating invalid
Image              PC                Routine            Line        Source
MOM6               0000000006466C84  Unknown               Unknown  Unknown
MOM6               0000000005E16EC0  Unknown               Unknown  Unknown
MOM6               0000000002B5C866  mom_open_boundary        3820  MOM_open_boundary.F90
MOM6               00000000041CFE2F  mom_boundary_upda         136  MOM_boundary_update.F90
MOM6               00000000011DEF26  mom_dynamics_spli         468  MOM_dynamics_split_RK2.F90
MOM6               000000000260C5E4  mom_mp_step_mom_d        1064  MOM.F90
MOM6               0000000002602C6B  mom_mp_step_mom_          784  MOM.F90
MOM6               0000000000579DA4  ocean_model_mod_m         583  ocean_model_MOM.F90
MOM6               00000000022522C7  MAIN__                   1082  coupler_main.F90
MOM6               00000000004015F2  Unknown               Unknown  Unknown
MOM6               000000000653590F  Unknown               Unknown  Unknown
MOM6               00000000004014DA  Unknown               Unknown  Unknown

amoebaliz avatar Aug 04 '21 18:08 amoebaliz

I've run your experiment (provided separately) with Intel and GNU, debug and repro, but all appear to be running without any problems.

Could this possibly have already been fixed? What is the hash of your MOM6? I am using 7883f633 (from 6 August).

marshallward avatar Aug 12 '21 13:08 marshallward

I was finally able to replicate something here, though not exactly what you are seeing. These are from Intel 18 "debug":

target time yyyy/mm/dd hh:mm:ss= 1980/ 1/ 1  3:57:30
 t1, t2, w1, w2=            1           2  0.989035087719298
  1.096491228070175E-002
 ibuf=            1           2
 i1,i2=            1           2
forrtl: error (182): floating invalid - possible uninitialized real/complex variable.
Image              PC                Routine            Line        Source             
MOM6               0000000005F4B71E  Unknown               Unknown  Unknown
MOM6               0000000005901AC0  Unknown               Unknown  Unknown
MOM6               000000000588F688  icepack_mechred_m        1239  icepack_mechred.F90
MOM6               00000000058804E8  icepack_mechred_m         386  icepack_mechred.F90
MOM6               0000000002C7151D  ice_ridging_mod_m         380  ice_ridge.F90
MOM6               0000000003442050  sis_transport_mp_         265  SIS_transport.F90
MOM6               0000000000C6A31E  sis_dyn_trans_mp_         615  SIS_dyn_trans.F90
MOM6               0000000002165A33  ice_model_mod_mp_         327  ice_model.F90
MOM6               0000000002161E56  ice_model_mod_mp_         156  ice_model.F90
MOM6               0000000000F3E4CC  MAIN__                   1034  coupler_main.F90
MOM6               00000000004015EE  Unknown               Unknown  Unknown
MOM6               000000000601E34F  Unknown               Unknown  Unknown
MOM6               00000000004014DA  Unknown               Unknown  Unknown

I am using the GFDL branches of the packages, plus whatever flags are in the mkmf templates, so it's probably slightly different from what you are running.

It took quite a few timesteps for a problem to emerge. (looks like about ~48 timesteps here).

The error here is coming from Icepack, so perhaps this is where the problem is arising. Seemed to happen at the 4-hour mark, although I don't see anything signficant about that timestep. (Coupling steps are 1 hr).

Will keep looking.

Edit: Not exactly sure why, but I now get your error in regrid_edge_values.F90. Could have been an unrelated initialization issue in Icepack which I can no longer replicate.

marshallward avatar Aug 13 '21 13:08 marshallward

Somehow this line has resulted in a div-by-zero: https://github.com/NOAA-GFDL/MOM6/blob/e88cdaa785b9fab4e24286031bdd409caf673034/src/ALE/regrid_edge_values.F90#L291

The zero is from h1 + (h2 + h3), even though there is a test here to ensure that h2 + h3 is nonzero: https://github.com/NOAA-GFDL/MOM6/blob/e88cdaa785b9fab4e24286031bdd409caf673034/src/ALE/regrid_edge_values.F90#L266-L276

I don't know how it happened, but somehow h1 is not only negative, but exactly equal to the negative of h2+h3:

136:  h0  0.000000000000000E+000
136:  h1  -5215.31998046875     
136:  h2   2.20500000000000     
136:  h3   5213.11498046875     
136:  h2+h3   5215.31998046875     
136:  h1+(h2+h3)  0.000000000000000E+000

While I could just add an additional check for h1+(h2+h3) == 0, I am guessing it might be more important to understand how this happened.

marshallward avatar Aug 13 '21 19:08 marshallward

I see that the line that is crashing is using the 2018_answers version of the code. There is a newer option within the code (generally set by selecting DEFAULT_2018_ANSWERS = FALSE or probably more specifically REMAPPING_2018_ANSWERS = False in your MOM_input files) in which many of the regridding solvers are mathematically equivalent, but have been refactored to make them much more robust and less prone to problems arising from roundoff.

Basically, instead of naively doing Gaussian elimination for matrix solvers or just the standard Thomas algorithm for solving tridiagonal matrices, the newer versions expand the expressions with mathematical expressions to replace differences of numbers that might be very close together other positive definite expressions, effectively replacing 1 / (1*h - (1-gamma)*h) with 1 / (gamma*h), where h is a positive-definite layer thickness and we can prove that 0 < gamma <= 1. Because these algorithmic changes do change answers, we have added run-time flags to use these older versions of the solver expressions to reproduce old solutions, but perhaps we need to be more aggressive in discouraging their use.

This raises several questions for me: (1) Was there a deliberate choice to use the older (and less robust) 2018 versions of the expressions in this experiment? (2) If so, why? (3) If not, how is it that this particular line of code is being called? and (4) Is the negative thickness here a symptom of the use of the older and less robust solvers, or is it a symptom of something else that is unrelated to the remapping solvers?

Hallberg-NOAA avatar Aug 15 '21 08:08 Hallberg-NOAA

@Hallberg-NOAA I don't think this is happening with DEFAULT_2018_ANSWERS = True. The division-by-zero is happening on L291, which is outside of the 2018-answer block. I was also able to replicate this error using the flags provided by Liz, which has all of the 2018 flag answers set to false.

marshallward avatar Aug 15 '21 13:08 marshallward

The error appears to be coming from the ALE sponge, not the ALE code.

The issue seems to be that mask_z(4,:,:) has uninitialized values in its halos, which are then propagated to non-halo values of mask_u(4,:,:), which eventually lead to the non-monotonic and un-physical reference thicknesses and failures inside the ALE code.

Although mask_z is initialized here:

https://github.com/NOAA-GFDL/MOM6/blob/90af3cbfeb33e1b58016fcc612ce82ae70cf14c9/src/parameterizations/vertical/MOM_ALE_sponge.F90#L907

the horiz_interp_and_extrap_tracers function call on the next line contains a deallocation and re-allocation of mask_z (among other variables). When that happens, we should assume that the values get wiped out and replaced with random numbers, although the actual behavior is going to depend on the compiler.

In this case, GCC is probably just re-using the old value, which is why it still works, whereas Intel is allocating a new part of the heap, with random model-crashing values.

In any case, we have to assume that the array is uninitialized, and we either need to re-initialize these values, or we need avoid re-allocating the arrays inside of the function.


Having said all that, even if I re-initialize this array, I see the icepack error later in the run (at ~hour 12) so we are not yet through this problem.

marshallward avatar Aug 27 '21 14:08 marshallward

Following up on this issue:

  • PR #1495 appears to resolve the original issue in regrid_edges
  • After fixing this, I can now run the model for ~12 hours. After that, I can now consistently replicating the error in Icepack which I saw in my original tests.
  • The SIS2 PR #163 (and possibly MOM6 #1493) seem to help address some other SIS2 initialization issues which help to reproduce the icepack error, although I don't think they are related to the error reported in icepack

I have looked further into this error, and it seems that the SIS2 field trcrn(1,:) has uninitialized values inside it. These get passed into the icepack library, where the bad initialization is detected.

I do not believe that this is due to icepack, but is rather due to SIS2. I believe something is happening in this code block:

https://github.com/NOAA-GFDL/SIS2/blob/ae959ed3efb31dc4f46a6c07fa56e74885bbdcb0/src/ice_ridge.F90#L324-L349

I am not very familiar with SIS2, but I will keep looking and see if anything makes sense here. There is a lot of pointer manipulation going on, so it's possible that something is not being set properly.

marshallward avatar Sep 04 '21 01:09 marshallward

I've looked at this more closely. This seems to be a bug in SIS2.

The TrReg (tracer registry) always starts with the Tr_ice_enth_ptr, but for reasons that I cannot understand, it only fills the first category:

324       ntrcr = 0
...
327         ntrcr=ntrcr+1
328         trcrn(ntrcr,1) = Tr_ice_enth_ptr(i,j,1,1) ! surface temperature? this is taken from th    e thinnest ice category
...
331         do k=1,nL_ice ; ... ; enddo
337         do k=1,nL_snow ; ... ; enddo
343         do k=1,nL_ice ; ... ; enddo

If I just add this:

trcrn(ntrcr,2:nCat) = 0.

then the model appears to run (so far at least; up to day 2) although I am not sure if this is the preferred solution. We should talk more on Monday.

marshallward avatar Sep 04 '21 12:09 marshallward

This commit appears to have fixed the issue: https://github.com/NOAA-GFDL/SIS2/commit/fa3f6b19920684a4bd172a05d1824aca18a9ff56

(This was in PR https://github.com/NOAA-GFDL/SIS2/pull/179)

marshallward avatar Oct 31 '22 17:10 marshallward

I feel comfortable that all of the issues here have been resolved, so I will close the issue.

marshallward avatar Oct 31 '22 17:10 marshallward