scream
scream copied to clipboard
NaN in T_2m for ne1024 on frontier
After 35 days of simulation, a ne1024 case running on frontier crashed.
model date = 20190905
Atmosphere step = 30520
model time = 2019-09-05 07:46:40
8723: terminate called after throwing an instance of 'std::logic_error'
8723: what(): /lustre/orion/cli115/proj-shared/noel/wacmy/machines_frontier/components/eamxx/src/share/atm_process/atmosphere_process.cpp:432: FAIL:
8723: false
8723: Error! Failed post-condition property check (cannot be repaired).
8723: - Atmosphere process name: SurfaceCouplingImporter
8723: - Property check name: NaN check for field T_2m
8723: - Atmosphere process MPI Rank: 8723
8723: - Message: FieldNaNCheck failed.
8723: - field id: T_2m[Physics PG2] <double:ncol>(1536) [K]
8723: - entry (16066370)
8723: - lat/lon: (29.664181, 265.847168)
/lustre/orion/cli115/proj-shared/noel/e3sm_scratch/mf00/t.machines_frontier.F2010-SCREAMv1.ne1024pg2_ne1024pg2.frontier-scream-gpu.n2048t8x6.vth200.C.O1
This run seems to be with IC files from Chris. These runs have a lot of warnings about cold T at the bottom from the very beginning,
2937: WARNING:CAAR: k=128,theta(k)=94.938653<100.000000=th_thresh, applying limiter
2937: WARNING:CAAR: k=128,theta(k)=99.903874<100.000000=th_thresh, applying limiter
2937: WARNING:CAAR: k=128,theta(k)=99.903874<100.000000=th_thresh, applying limiter
2937: WARNING:CAAR: k=128,theta(k)=98.570180<100.000000=th_thresh, applying limiter
and the number of warnings only grows from log file to log file. Is the issue with IC, or should we try lowering dynamics dt, tuning HV or any other diffusive mechanisms?
In a run without cess/dyamond IC changes, i only have these warnings in 9 days
/ccs/home/onguba/eff/fgpu-build-june13b-t1-r16384/run/e3sm.log.1354477.230617-104655.gz: 6509: WARNING: Tl1_1 has 1 values <= allowable value. Resetting to minimum value.
/ccs/home/onguba/eff/fgpu-build-june13b-t1-r16384/run/e3sm.log.1355254.230617-170829.gz:14171: WARNING: Tl1_1 has 1 values <= allowable value. Resetting to minimum value.
/ccs/home/onguba/eff/fgpu-build-june13b-t1-r16384/run/e3sm.log.1355254.230617-170829.gz: 2269: WARNING: Tl1_1 has 1 values <= allowable value. Resetting to minimum value.
/ccs/home/onguba/eff/fgpu-build-june13b-t1-r16384/run/e3sm.log.1355254.230617-170829.gz:14407: WARNING: BalanceCheck: soil balance error (W/m2)
I was running a separate case (to test restarts and general stability) that only differed in the opt level used in 2 files. And the case failed in the same way and I'm going to guess they are likely BFB.
/lustre/orion/cli115/proj-shared/noel/e3sm_scratch/mf00/t.machines_frontier.F2010-SCREAMv1.ne1024pg2_ne1024pg2.frontier-scream-gpu.n2048t8x6.vth200.C
They fail at same step, same test in error message, and on the same MPI.
Atmosphere step = 30520
8723: Error! Failed post-condition property check (cannot be repaired).
8723: - Atmosphere process name: SurfaceCouplingImporter
8723: - Property check name: NaN check for field T_2m
8723: - Atmosphere process MPI Rank: 8723
8723: - Message: FieldNaNCheck failed.
8723: - field id: T_2m[Physics PG2] <double:ncol>(1536) [K]
8723: - entry (16066370)
8723: - lat/lon: (29.664181, 265.847168)
8723:
8723: *************************** INPUT FIELDS ******************************
8723:
8723: ------- INPUT FIELDS -------
8723:
8723: ************************** OUTPUT FIELDS ******************************
8723: T_2m<ncol>(1536)
8723:
8723: T_2m(225)
8723: nan,
8723: -----------------------------------------------------------------------
8723: landfrac<ncol>(1536)
8723:
8723: landfrac(225)
8723: 0.998098,
8723: -----------------------------------------------------------------------
8723: ocnfrac<ncol>(1536)
8723:
8723: ocnfrac(225)
8723: 0.00190223,
8723: -----------------------------------------------------------------------
Something is definitely off with this simulation. Here's T_2m (note the color axis):
I think it's worth running 2-3 day tests with frequent output and checking what's going on.. maybe we can try at ne256.
Noel shared with me the ne256 simulation and it looks good:
/lustre/orion/cli115/proj-shared/noel/e3sm_scratch/mf00/t.machines_frontier.F2010-SCREAMv1.ne256pg2_ne256pg2.frontier-scream-gpu.n0384t8x6.vth200.Blong/run/output.scream.monthly.AVERAGE.nmonths_x1.2019-08-01-00000.nc
Thinking about the differences, the ne1024 run was monthly output with frequent restarts in between. It could be an issue about the output working correctly, rather than the model state itself.. will need to run tests.
It's interesting that the spatial distribution between ne256 and ne1024 is ~identical. It's just the colorbar that is different. This makes me think that (like Chris surmises) the issue is a matter of dividing the accumulated sum by the number of samples before writing output...
@crterai are you running off of a recent master? There was an issue a while ago regarding restart of accumulated quantities, but it got fixed a few weeks ago. So unless your repo is quite old, I would not expect this.
If it is, in fact, a matter of restarts, a simple ne4 case should reproduce the same problem. Could you paste here the details of the 1024 run that had frequent restarts? I can try to dig a bit to see if there are bugs in the restart logic.
From what @oksanaguba said, the branch we are using heremachines/frontier
was based off a scream repo of May 19th.
From what @oksanaguba said, the branch we are using here
machines/frontier
was based off a scream repo of May 19th.
Ah, yes! The PR that fixed the accumulation bug went in on May 25th, so this makes sense. If merging master is not doable, then the workaround is to use a restart frequence that coincides with the avg window size. If the restart happens on a model output step, May 19th repo should still give the correct avg.
Okay, the daily output from day 2 for this case (with new high rez SST file) looks reasonable for T_2m:
/lustre/orion/cli115/proj-shared/noel/e3sm_scratch/maf-jun19/t.maf-jun19.F2010-SCREAMv1.ne1024pg2_ne1024pg2.frontier-scream-gpu.n2048t8x6.vth200.SSTocean.od/run/output.scream.daily.AVERAGE.ndays_x1.2019-08-02-00000.nc
This doesn't close the case for the NaN issue that Noel ran into, but just ensures that our runs look reasonable.
With updated repo, that we know at least corrects issue with monthly output, I have run ne1024 to 71 days so far. Not sure that proves we no longer see the NaN noted above. And not sure there is interest in trying to figure out if the longer run now is actually due to repo changes, or not.