E3SM eamxx: NaN check for field surf_radiative_T with ne120 and ne256 monogrid

Using ne120 monogrid, and the recently created 128 vertical level IC, I tried to run longer on pm-gpu, but it failed

I have two cases that fail the same way, one with 43 nodes and one using 85 nodes. Both stop at model date = 00010630, step 17336.

First I see the Tl1_1 has 1 values <= allowable value, immediately followed by the CAAR warnings:

  0: MCT::m_Router::initp_: GSMap indices not increasing...Will correct
119: WARNING: Tl1_1 has 1 values <= allowable value.  Resetting to minimum value.
106: WARNING:CAAR: k=128,theta(k)=99.951741<100.000000=th_thresh, applying limiter
106: WARNING:CAAR: k=128,theta(k)=99.524598<100.000000=th_thresh, applying limiter
106: WARNING:CAAR: k=128,theta(k)=98.575935<100.000000=th_thresh, applying limiter

And then first Error! message:

106:   what():  /global/cfs/cdirs/e3sm/ndk/repos/se53-feb13/components/eamxx/src/share/atm_process/atmosphere_process.cpp:307: FAIL:
106: false
106: Error! Failed post-condition property check (cannot be repaired).
106:   - Atmosphere process name: SurfaceCouplingImporter
106:   - Property check name: NaN check for field surf_radiative_T
106:   - Atmosphere process MPI Rank: 106
106:   - Message: FieldNaNCheck failed.
106:   - field id: surf_radiative_T[Physics PG2] <double:COL>(2008) [K]
106:   - entry (179399)
106:   - lat/lon: (-32.700373, 291.937718)
106:
106:  *************************** INPUT FIELDS ******************************
106:
106:   ------- INPUT FIELDS -------
106:
106:  ************************** OUTPUT FIELDS ******************************
106:      T_2m<COL>(2008)
106:
106:   T_2m(358)
106:     147.481,
106:  -----------------------------------------------------------------------
106:      qv_2m<COL>(2008)
106:
106:   qv_2m(358)
106:     1.99301e-07,
106:  -----------------------------------------------------------------------
106:      sfc_alb_dif_nir<COL>(2008)
106:
106:   sfc_alb_dif_nir(358)
106:     0.222123,
106:  -----------------------------------------------------------------------
106:      sfc_alb_dif_vis<COL>(2008)
106:
106:   sfc_alb_dif_vis(358)
106:     0.133637,
106:  -----------------------------------------------------------------------
106:      sfc_alb_dir_nir<COL>(2008)
106:
106:   sfc_alb_dir_nir(358)
106:     0.216921,
106:  -----------------------------------------------------------------------
106:      sfc_alb_dir_vis<COL>(2008)
106:
106:   sfc_alb_dir_vis(358)
106:     0.133422,
106:  -----------------------------------------------------------------------
106:      snow_depth_land<COL>(2008)
106:
106:   snow_depth_land(358)
106:     0.000135651,
106:  -----------------------------------------------------------------------
106:      surf_evap<COL>(2008)
106:
106:   surf_evap(358)
106:     -5.29259e-07,
106:  -----------------------------------------------------------------------
106:      surf_lw_flux_up<COL>(2008)
106:
106:   surf_lw_flux_up(358)
106:     -10.8045,
106:  -----------------------------------------------------------------------
106:      surf_mom_flux<COL,CMP>(2008,2)
106:
106:   surf_mom_flux(358,:)
106:     -8.29575, -2.37998,
106:  -----------------------------------------------------------------------
106:      surf_radiative_T<COL>(2008)
106:
106:   surf_radiative_T(358)
106:     -nan,
106:  -----------------------------------------------------------------------
106:      surf_sens_flux<COL>(2008)
106:
106:   surf_sens_flux(358)
106:     3027.75,
106:  -----------------------------------------------------------------------
106:      wind_speed_10m<COL>(2008)
106:
106:   wind_speed_10m(358)
106:     30.1183,
106:  -----------------------------------------------------------------------

/pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se53-feb13/t.se53-feb13.F2010-SCREAMv1.ne120pg2_ne120pg2.pm-gpu.n043t4xX.L128

Feb 18 '23 18:02 ndkeen

With ne256 monogrid, which we were already using, I also see similar error. It happens approximately similar model date as well model date = 00010525 (with ne120 failing about a month after this).

  0: MCT::m_Router::initp_: GSMap indices not increasing...Will correct
311: WARNING: Tl1_1 has 1 values <= allowable value.  Resetting to minimum value.
285: WARNING:CAAR: k=128,theta(k)=99.975185<100.000000=th_thresh, applying limiter
285: WARNING:CAAR: k=128,theta(k)=99.975185<100.000000=th_thresh, applying limiter
285: WARNING:CAAR: k=128,theta(k)=99.723051<100.000000=th_thresh, applying limiter
...
236:   what():  /global/cfs/cdirs/e3sm/ndk/repos/se53-feb13/components/eamxx/src/share/atm_process/atmosphere_process.cpp:307: FAIL:
236: false
236: Error! Failed post-condition property check (cannot be repaired).
236:   - Atmosphere process name: SurfaceCouplingImporter
236:   - Property check name: NaN check for field surf_radiative_T
236:   - Atmosphere process MPI Rank: 236
236:   - Message: FieldNaNCheck failed.
236:   - field id: surf_radiative_T[Physics PG2] <double:COL>(4096) [K]
236:   - entry (815857)
236:   - lat/lon: (-33.207194, 291.181688)
236:
236:  *************************** INPUT FIELDS ******************************
236:
236:   ------- INPUT FIELDS -------
236:
236:  ************************** OUTPUT FIELDS ******************************
236:      T_2m<COL>(4096)
236:
236:   T_2m(3696)
236:     152.521,
236:  -----------------------------------------------------------------------
236:      qv_2m<COL>(4096)
236:
236:   qv_2m(3696)
236:     1.44605e-06,
236:  -----------------------------------------------------------------------
236:      sfc_alb_dif_nir<COL>(4096)
236:
236:   sfc_alb_dif_nir(3696)
236:     0.221189,
236:  -----------------------------------------------------------------------
236:      sfc_alb_dif_vis<COL>(4096)
236:
236:   sfc_alb_dif_vis(3696)
236:     0.120901,
236:  -----------------------------------------------------------------------
236:      sfc_alb_dir_nir<COL>(4096)
236:
236:   sfc_alb_dir_nir(3696)
236:     0.215391,
236:  -----------------------------------------------------------------------
236:      sfc_alb_dir_vis<COL>(4096)
236:
236:   sfc_alb_dir_vis(3696)
236:     0.12062,
236:  -----------------------------------------------------------------------
236:      snow_depth_land<COL>(4096)
236:
236:   snow_depth_land(3696)
236:     0.000192775,
236:  -----------------------------------------------------------------------
236:      surf_evap<COL>(4096)
236:
236:   surf_evap(3696)
236:     -6.77864e-06,
236:  -----------------------------------------------------------------------
236:      surf_lw_flux_up<COL>(4096)
236:
236:   surf_lw_flux_up(3696)
236:     -48.6872,
236:  -----------------------------------------------------------------------
236:      surf_mom_flux<COL,CMP>(4096,2)
236:
236:   surf_mom_flux(3696,:)
236:     -33.4984, -10.6215,
236:  -----------------------------------------------------------------------
236:      surf_radiative_T<COL>(4096)
236:
236:   surf_radiative_T(3696)
236:     -nan,
236:  -----------------------------------------------------------------------
236:      surf_sens_flux<COL>(4096)
236:
236:   surf_sens_flux(3696)
236:     5089.09,
236:  -----------------------------------------------------------------------
236:      wind_speed_10m<COL>(4096)
236:
236:   wind_speed_10m(3696)
236:     61.7287,
236:  -----------------------------------------------------------------------
236:
236:
236: Program received signal SIGABRT: Process abort signal.
236:
236: Backtrace for this error:

/pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se53-feb13/t.se53-feb13.F2010-SCREAMv1.ne256pg2_ne256pg2.pm-gpu.n096t4xX.long

Feb 24 '23 18:02 ndkeen

Back to the ne120 case. I wanted to get a restart closer to the crash. First I restarted from restart files in the case for 1 month to reach model date = 00010601. Then I switched to using ndays instead of nmonths as the unit and asked to run for 30 days with restart every 10th day. It actually completed and ran further than previous attempt. ? model date = 00010701. Then I tried to run a few more days and it fails at model date = 00010706 with a different error:

106: WARNING:CAAR: k=127,theta(k)=75.206669<100.000000=th_thresh, applying limiter
106: Bad dphi, dp3d, or vtheta_dp; label: 'Vertical remap: Negative (or nan) layer thickness detected, aborting!'; see hommexx.errlog.172.106
106: Exiting...
106: MPICH ERROR [Rank 106] [job id 5688967.0] [Fri Feb 24 10:05:54 2023] [nid008469] - Abort(101) (rank 106 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 101) - process 106
106:
106: aborting job:
106: application called MPI_Abort(MPI_COMM_WORLD, 101) - process 106
106: Kokkos::Cuda ERROR: Failed to call Kokkos::Cuda::finalize()
106: ERROR: For the pool allocator labeled "\360\244^^W^@^@^@^@AKL's primary memory pool":
106: ERROR: Trying to free an invalid pointer
106: This means you have either already freed the pointer, or its address has been corrupted somehow.
106: terminate called after throwing an instance of 'std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >'
106:
106: Program received signal SIGABRT: Process abort signal.
106:
106: Backtrace for this error:
106: #0  0x14b77a4d4d6f in ???
106: E3SM-Project/scream#1  0x14b77a4d4cdb in ???
106: E3SM-Project/scream#2  0x14b77a4d6374 in ???
106: E3SM-Project/scream#3  0x14b78970b608 in _ZN9__gnu_cxx27__verbose_terminate_handlerEv
106:    at ../../../../cpe-gcc-12.1.0-202208101649.1dfb26392197c/libstdc++-v3/libsupc++/vterminate.cc:95

Feb 24 '23 18:02 ndkeen

I tried another ne120 case using Feb24th repo that also fails in same place model date = 00010630. And is BFB with the cases noted above thru the last nstep dump of nstep= 199980.

The error is:


210: [DIRK] WARNING! Newton reached max iteration count, with deltaerr = 35858.08878763050597627
210: Bad dphi, dp3d, or vtheta_dp; label: 'DIRK Newton loop np1'; see hommexx.errlog.340.210
210: Exiting...
210: MPICH ERROR [Rank 210] [job id 5707327.0] [Sat Feb 25 14:48:03 2023] [nid008449] - Abort(102) (rank 210 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 102) - process 210
210:
210: aborting job:
210: application called MPI_Abort(MPI_COMM_WORLD, 102) - process 210
210: Kokkos::Cuda ERROR: Failed to call Kokkos::Cuda::finalize()
210: ERROR: For the pool allocator labeled "P\221\350#^@^@^@^@AKL's primary memory pool":
210: ERROR: Trying to free an invalid pointer
210: This means you have either already freed the pointer, or its address has been corrupted somehow.
210: terminate called after throwing an instance of 'std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >'


label: DIRK Newton loop np1
time-level 0
lat -5.876747062738510e-01 lon  5.052728184523584e+00
ie 69 igll 3 jgll 3 lev 0: bad dphi
level                   dphi                   dp3d              vtheta_dp
    0                   -nan  8.125531673431398e+00  1.006722367006377e+04

/pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se55-feb24/f120.F2010-SCREAMv1.ne120pg2_ne120pg2.se55-feb24.gnugpu.n085a4xX.L128.coldTout

Feb 26 '23 01:02 ndkeen

Regarding trying to make a restart file closer to the crash, I think I was tripped up again by https://github.com/E3SM-Project/scream/issues/2110 when I tried to change the units. It may be we should disregard any CONTINUE run that was done after changing units. I can stick with nmonths and try again.

Feb 27 '23 19:02 ndkeen

I'm also not sure I'm allowed to change the output for a restarted run. That is, run for 6 months, then try to run another month with output changed to:

Fields:
  Physics PG2:
    Field Names:
    - ps
    - omega
    - horiz_winds
    - T_mid
    - qv
    - qc
    - qr
    - qi
    - tke
    - surf_sens_flux
    - surf_evap
    - surf_radiative_T
    - precip_liq_surf_mass
Max Snapshots Per File: 1
output_control:
  Frequency: 1
  MPI Ranks in Filename: false
  Timestamp in Filename: true
  avg_type_in_filename: true
  frequency_in_filename: true
  frequency_units: nsteps

So, just to make sure we make some progress with this, I started a new case with all of this extra output from day 1. Yes, it is way too much (~25TB), but we can hopefully learn something from it and delete what we don't need, while I also try smaller cases to verify if I can change output streams on restart.

/pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se55-feb24/f120.F2010-SCREAMv1.ne120pg2_ne120pg2.se55-feb24.gnugpu.n043a4xX.L128.coldTout.extra

It fails at exact same location, with exact same error message (as expected).

With 43 pm-gpu nodes, adding this extra output only runs at 75% performance of previous run with "normal" output, which is maybe impressive.

Feb 28 '23 01:02 ndkeen

@ndkeen when restarting output, the only changes you are allowed to do are:

changes in file specs (xyz_in_filename or max snaps per file)
remove some vars, meaning that the set of output vars in the restarted run is a subset of the one in the original output

Any other change is likely not going to work, possibly not even generating an error.

However, you can add new streams. The AD will still try to restart them though, so you must set

Restart:
  Perform Restart: false

in the yaml file for that stream. You can also set this for old streams, in which case you can change as many parameters as you want since, for all practical purposes, it's like a completely new stream.

Feb 28 '23 16:02 bartgol

I was following what @crterai had suggested, but I don't think either of us realized these restrictions. And I'm not sure I follow.

Feb 28 '23 16:02 ndkeen

Yes, we don't have a user guide for model output (todo list!). I saw I'm also not sure I'm allowed to change the output for a restarted run., and thought I added some comments.

Bottom line is that a restarted run cannot change how/what it writes to file (with the exception that you can remove some vars). If you need to change, you need to "not restart" the output by setting Perform Restart: false.

Feb 28 '23 17:02 bartgol

Started looking at the single timestep level output data from the ne120 simulation that @ndkeen ran, and this is what the global minimum temperature for T_mid@bot does leading up to the crash. Note that the x-axis should be 'Steps' rather than days (thanks Noel for catching this!).

Location of that global minimum T in the last timestep matches up with the lat/lon given by the error message: [-32.70037346,291.93771758]. Which is just east of the Andes:

So this cold T also occurs adjacent to topography. Finally, a gif that shows the animation of bottom level variables (T_mid, qc, qr, and qi) in the last day leading up to the crash. https://portal.nersc.gov/cfs/e3sm/terai/SCREAM/v1_analysis/ne120_Tbot_AND_q_CrashPoint.gif

Feb 28 '23 19:02 crterai

And this is the dynamics field in the bottom level for the same time period. https://portal.nersc.gov/cfs/e3sm/terai/SCREAM/v1_analysis/ne120_Tbot_AND_DYN_CrashPoint.gif The noise that starts to show up in the U and V field towards the end of the animation are a bit unphysical looking. @mt5555 or @oksanaguba, any thoughts? Note that this is all on the pg2 grid - not the dynamics grid.

Feb 28 '23 21:02 crterai

Here is an attempt to plot the min T_mid per vertical level -- this is just last few steps before the crash. I mislabeled "Level1" -- should be "Level126" all

Feb 28 '23 22:02 ndkeen

@ndkeen can you redo that plot with only the 10 or 20 lowest levels?

Feb 28 '23 22:02 whannah1

Sure. Note I mislabeled one curve above as Level1 -- all the cold T lines are happening near the ground (highest vert levels). This would be the ~15 curves nearest the surface.

Note to parse the data, I'm using

fx=xarray.open_mfdataset(nc)
T_mid_min=fx['T_mid'][:,:,127]
T_mid_min_timeseries=np.min(T_mid_min,axis=1)

ie, that's what I an labelling Level127

lastday-bottom

Feb 28 '23 23:02 ndkeen

Since Level 126 is leading the charge it seems safe to say that we need to be looking away from the lowest level to figure this out.

Feb 28 '23 23:02 whannah1

Agreed. I'm curious whether the U and V changes lead the T change (which could be investigated by plotting the timeseries for all 3 variables divided by their pre-coldT ave value for the cell that eventually crashed).

I also find it odd that there's a clear front in V that propagates from the bottom left of the figure and the model seems to crash when this front reaches the problem area. It also looks to me (though it's hard to tell) like omega doesn't get weird - just horizontal winds.

Mar 01 '23 00:03 PeterCaldwell

Latest results I think remain consistent with the old hypothesis that this is a lack of vertical mixing in stable boundary layer conditions:

We see a potentially similar behavior in HS+topo, https://acme-climate.atlassian.net/wiki/spaces/NGDNA/pages/3599073307/Cold-T+like+issue+in+Held+Suarez (which has no physics - suggesting it's a dycore phenomena not triggered by physics, but some boundary layer mixing is needed to control it)
we see a very similar thing in EAM at NE256 (and occasionally in the RRM in high res region - see: https://github.com/E3SM-Project/E3SM/issues/5089 )
It was controlled in EAM by increasing the backround vertical diffusion in CLUBB ( by reducing lmin_coef from .1 to 0.025)
~~It would be nice to try adjusting the similar SHOC parameter: decrease Scalar Ckh_s_max and Scalar Ckm_s_max from .1 to 0.025 ~~ ignore this! see @bogensch comments below

Based on the RRM results in E3SM#5089 above, a workaround which doesn't fix the problem but prevents the run from crashing could be to increase the theta limiter min value from 100K to 200K

Mar 01 '23 04:03 mt5555

consistent with the old hypothesis that this is a lack of vertical mixing in stable boundary layer conditions

We need to be a bit careful with wording here. I'm sure that with enough vertical mixing we won't have coldT problems, but that doesn't mean that the deficiency in our simulations is vertical mixing... just that it's a convenient patch. I'd really like to understand the actual source of trouble even if we end up using artificial diffusion to fix it.

Mar 01 '23 16:03 PeterCaldwell

lmin_coef in CLUBB and Ckm_s_max in SHOC are two completely different parameters and NOT comparable. Reducing Ckh_s_max in SHOC will for certain lead to more frequent cold T.

Mar 01 '23 16:03 bogensch

@bogensch @PeterCaldwell is it really even possible to have this rate of cooling resulting from a stable layer? The temperature drop seems extreme, and at one point I think I did a rough estimate that suggested the cooling rate was much higher than the longwave cooling rate.

Or maybe I'm misunderstanding how the stable layer problem works... is longwave cooling the primary cooling mechanism in a situation where a stable layer completely shuts down turbulent mixing?

Mar 01 '23 16:03 whannah1

Interesting point, Walter. I think there are 2 different things going on here. You're right to ask how the layer is cooling so rapidly - is it LW cooling, evaporation/sublimation, mass flux divergence (unlikely), negative surface fluxes, or ???. Looking at the energy budget for this layer would be very useful for answering this. The fact that we're in a stable BL regime is kind of independent of the reason for cooling... it only matters in the sense that a regime with more vertical mixing wouldn't allow the surface layers to get so cold.

Mar 01 '23 16:03 PeterCaldwell

Are you sure "mass flux divergence" is unlikely? I think that's what it would be in the case of Held-Suarez forcing (although we haven't actually confirmed the HS coldT issues are the same as this one).

From our earlier results, increasing horiztontal diffusion does help quite a bit. I'm against that, probably for the same reason Pete is against increasing the vertical diffusion :-) Its so isolated and localized, it would be a shame to crank up the diffusion globally (vertical or horizontal).

Mar 01 '23 16:03 mt5555

I recall from earlier analysis that the surface heat fluxes were acting to oppose the cooling, which makes me think that rad and micro (and maybe dynamics) are the prime suspects.

Mar 01 '23 16:03 whannah1

I think agree with Walter that all cold T's we've analyzed so far have not been driven by the surface (due to surface heat fluxes) - SHF generally acted to counteract the cold air temperature. And I suspect rad is not the culprit, unless something is horribly wrong, since longwave cooling typically acts to cool from a warmer place to a cooler place and nothing close to that area surface seems to be getting colder than the air itself. So I think that microphysics or dynamics is the likely cause. In the v0 model, I believe we were seeing that there was a large amount of divergence that was causing the cooling. I've created a height-time cross-section of the column that experiences the coldest T in the second to bottom layer. It looks like the strong cooling precedes the cloud water formation and there's a distinct convergence/divergence signal that corresponds with the cooling (from the omega field), so I'm guessing that it's the dynamics again in this case. And associated convergence/divergence from domega/dP

Mar 01 '23 18:03 crterai

@crterai that's a nice result! Can you zoom in on time 40-60?

Mar 01 '23 20:03 whannah1

I've changed the T diff color bars to a wider color range.

Mar 01 '23 21:03 crterai

I've hidden my atm sci PhD so you can't take it away from me for asking such a dumb question... I understand from the ideal gas law how decreasing pressure decreases temperature, so it makes intuitive sense that mass flux divergence would cool a layer. I can't figure out how to convert Chris' new divergence plots into a cooling rate though. It seems like $p=\rho R_d T$ means that $dT$ should be proportional to $dp$ rather than $d\omega/dp$. Somewhat equivalently, I can't figure out whether divergence means that the mass of the column above is getting smaller, or whether a cell is literally being evacuated so it becomes cold like outer space. I think sigma-p coordinates imply the former.

Mar 01 '23 21:03 PeterCaldwell

I think at this point having a breakdown of the temperature tendencies from each process would be really valuable. How much work would it take to get that for dynamics, P3, and SHOC? @AaronDonahue @bartgol

Mar 01 '23 23:03 whannah1

Mmm, it might take some time, but probably not too much. I don't have time to do it this week, but I can get to it on Monday. TBC, you mean adding something like

void MyAtmProc::run (double dt) {
   T_start = get_field_out("T_mid");
   run
   T_tend += (get_field_out("T_mid")-T_start) / dt;

right?

I think it can be done in half a day.

Mar 02 '23 00:03 bartgol

@bartgol - yeah, I think your above pseudo-code is what we want to do. I was wondering whether we could add this to the AD layer rather than putting it in the interface layer for each proc (which I think is what you're doing above)... but ultimately it's fine if we add this sort of code for each process separately.

Mar 02 '23 01:03 PeterCaldwell

Yeah, I can add it at the infrastructure level, triggered by some runtime flags (this causes using 2 extra fields for each tracked field, so must be explicitly requested). I will get to it on Monday (hopefully).

Mar 02 '23 17:03 bartgol