scream NaN in T_2m for ne1024 on frontier

After 35 days of simulation, a ne1024 case running on frontier crashed.

model date =   20190905

Atmosphere step = 30520
  model time = 2019-09-05 07:46:40

 8723: terminate called after throwing an instance of 'std::logic_error'
 8723:   what():  /lustre/orion/cli115/proj-shared/noel/wacmy/machines_frontier/components/eamxx/src/share/atm_process/atmosphere_process.cpp:432: FAIL:
 8723: false
 8723: Error! Failed post-condition property check (cannot be repaired).
 8723:   - Atmosphere process name: SurfaceCouplingImporter
 8723:   - Property check name: NaN check for field T_2m
 8723:   - Atmosphere process MPI Rank: 8723
 8723:   - Message: FieldNaNCheck failed.
 8723:   - field id: T_2m[Physics PG2] <double:ncol>(1536) [K]
 8723:   - entry (16066370)
 8723:   - lat/lon: (29.664181, 265.847168)

/lustre/orion/cli115/proj-shared/noel/e3sm_scratch/mf00/t.machines_frontier.F2010-SCREAMv1.ne1024pg2_ne1024pg2.frontier-scream-gpu.n2048t8x6.vth200.C.O1

Jun 18 '23 15:06 ndkeen

This run seems to be with IC files from Chris. These runs have a lot of warnings about cold T at the bottom from the very beginning,

2937: WARNING:CAAR: k=128,theta(k)=94.938653<100.000000=th_thresh, applying limiter 
2937: WARNING:CAAR: k=128,theta(k)=99.903874<100.000000=th_thresh, applying limiter 
2937: WARNING:CAAR: k=128,theta(k)=99.903874<100.000000=th_thresh, applying limiter 
2937: WARNING:CAAR: k=128,theta(k)=98.570180<100.000000=th_thresh, applying limiter

and the number of warnings only grows from log file to log file. Is the issue with IC, or should we try lowering dynamics dt, tuning HV or any other diffusive mechanisms?

Jun 18 '23 16:06 oksanaguba

In a run without cess/dyamond IC changes, i only have these warnings in 9 days

/ccs/home/onguba/eff/fgpu-build-june13b-t1-r16384/run/e3sm.log.1354477.230617-104655.gz: 6509: WARNING: Tl1_1 has 1 values <= allowable value.  Resetting to minimum value.
/ccs/home/onguba/eff/fgpu-build-june13b-t1-r16384/run/e3sm.log.1355254.230617-170829.gz:14171: WARNING: Tl1_1 has 1 values <= allowable value.  Resetting to minimum value.
/ccs/home/onguba/eff/fgpu-build-june13b-t1-r16384/run/e3sm.log.1355254.230617-170829.gz: 2269: WARNING: Tl1_1 has 1 values <= allowable value.  Resetting to minimum value.
/ccs/home/onguba/eff/fgpu-build-june13b-t1-r16384/run/e3sm.log.1355254.230617-170829.gz:14407:  WARNING: BalanceCheck: soil balance error (W/m2)

Jun 18 '23 16:06 oksanaguba

I was running a separate case (to test restarts and general stability) that only differed in the opt level used in 2 files. And the case failed in the same way and I'm going to guess they are likely BFB.

/lustre/orion/cli115/proj-shared/noel/e3sm_scratch/mf00/t.machines_frontier.F2010-SCREAMv1.ne1024pg2_ne1024pg2.frontier-scream-gpu.n2048t8x6.vth200.C

They fail at same step, same test in error message, and on the same MPI. Atmosphere step = 30520

8723: Error! Failed post-condition property check (cannot be repaired).
 8723:   - Atmosphere process name: SurfaceCouplingImporter
 8723:   - Property check name: NaN check for field T_2m
 8723:   - Atmosphere process MPI Rank: 8723
 8723:   - Message: FieldNaNCheck failed.
 8723:   - field id: T_2m[Physics PG2] <double:ncol>(1536) [K]
 8723:   - entry (16066370)
 8723:   - lat/lon: (29.664181, 265.847168)
 8723: 
 8723:  *************************** INPUT FIELDS ******************************
 8723: 
 8723:   ------- INPUT FIELDS -------
 8723: 
 8723:  ************************** OUTPUT FIELDS ******************************
 8723:      T_2m<ncol>(1536)
 8723: 
 8723:   T_2m(225)
 8723:     nan, 
 8723:  -----------------------------------------------------------------------
 8723:      landfrac<ncol>(1536)
 8723: 
 8723:   landfrac(225)
 8723:     0.998098, 
 8723:  -----------------------------------------------------------------------
 8723:      ocnfrac<ncol>(1536)
 8723: 
 8723:   ocnfrac(225)
 8723:     0.00190223, 
 8723:  -----------------------------------------------------------------------

Jun 20 '23 15:06 ndkeen

Something is definitely off with this simulation. Here's T_2m (note the color axis):

Jun 21 '23 17:06 crterai

I think it's worth running 2-3 day tests with frequent output and checking what's going on.. maybe we can try at ne256.

Jun 21 '23 17:06 crterai

Noel shared with me the ne256 simulation and it looks good: /lustre/orion/cli115/proj-shared/noel/e3sm_scratch/mf00/t.machines_frontier.F2010-SCREAMv1.ne256pg2_ne256pg2.frontier-scream-gpu.n0384t8x6.vth200.Blong/run/output.scream.monthly.AVERAGE.nmonths_x1.2019-08-01-00000.nc

Thinking about the differences, the ne1024 run was monthly output with frequent restarts in between. It could be an issue about the output working correctly, rather than the model state itself.. will need to run tests.

Jun 21 '23 18:06 crterai

It's interesting that the spatial distribution between ne256 and ne1024 is ~identical. It's just the colorbar that is different. This makes me think that (like Chris surmises) the issue is a matter of dividing the accumulated sum by the number of samples before writing output...

Jun 21 '23 19:06 PeterCaldwell

@crterai are you running off of a recent master? There was an issue a while ago regarding restart of accumulated quantities, but it got fixed a few weeks ago. So unless your repo is quite old, I would not expect this.

If it is, in fact, a matter of restarts, a simple ne4 case should reproduce the same problem. Could you paste here the details of the 1024 run that had frequent restarts? I can try to dig a bit to see if there are bugs in the restart logic.

Jun 21 '23 20:06 bartgol

From what @oksanaguba said, the branch we are using heremachines/frontier was based off a scream repo of May 19th.

Jun 21 '23 20:06 ndkeen

From what @oksanaguba said, the branch we are using heremachines/frontier was based off a scream repo of May 19th.

Ah, yes! The PR that fixed the accumulation bug went in on May 25th, so this makes sense. If merging master is not doable, then the workaround is to use a restart frequence that coincides with the avg window size. If the restart happens on a model output step, May 19th repo should still give the correct avg.

Jun 21 '23 20:06 bartgol

Okay, the daily output from day 2 for this case (with new high rez SST file) looks reasonable for T_2m: /lustre/orion/cli115/proj-shared/noel/e3sm_scratch/maf-jun19/t.maf-jun19.F2010-SCREAMv1.ne1024pg2_ne1024pg2.frontier-scream-gpu.n2048t8x6.vth200.SSTocean.od/run/output.scream.daily.AVERAGE.ndays_x1.2019-08-02-00000.nc

This doesn't close the case for the NaN issue that Noel ran into, but just ensures that our runs look reasonable.

Jun 22 '23 21:06 crterai

With updated repo, that we know at least corrects issue with monthly output, I have run ne1024 to 71 days so far. Not sure that proves we no longer see the NaN noted above. And not sure there is interest in trying to figure out if the longer run now is actually due to repo changes, or not.

Jun 27 '23 02:06 ndkeen

scream scream copied to clipboard

NaN in T_2m for ne1024 on frontier

scream
scream copied to clipboard