scream
scream copied to clipboard
Negative (or nan) layer thickness detected with ne256 Cess test on pm-gpu
With scream master of Sep 12th, I see an error with a ne256 Cess-like test on pm-gpu.
I have already reproduced the error with a different case and it fails in the same way (hashes are also same).
The fail is after model date = 20190929
.
I ran 1 month, then restarted, where it fails there.
170: WARNING:CAAR: dp3d too small. k=128, dp3d(k)=35.733449, dp0=300.714111
170: Negative (or nan) layer thickness detected, aborting!
170: Exiting...
170: MPICH ERROR [Rank 170] [job id 15852337.0] [Tue Sep 19 11:07:37 2023] [nid003488] - Abort(101) (rank 170 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 101) - process 170
170:
170: aborting job:
170: application called MPI_Abort(MPI_COMM_WORLD, 101) - process 170
170: Kokkos::Cuda ERROR: Failed to call Kokkos::Cuda::finalize()
/pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se75-sep12/cess-v1-cntl.ne256pg2_ne256pg2.F2010-SCREAMv1.se75-sep12.n0048t4x111XX1.tb.nofru.long
Another issue here is that the job is hanging after the error.
I remember that some of our issues have stemmed from having too long of a SHOC timestep. I see that in the ne256 setup, our dtime = 600 sec and our macmic is 3, which sets the SHOC timestep to 200sec. @bogensch - can you remind us how long the shoc timestep needs to be?
I would advocate making sure that SHOC time step is never greater than 150 s at any resolution configuration; since that seems to work well for long ne30 integrations. While 200 s may be okay, I'm not 100% sure about that since we've never tested this in a long simulation.
Note that we ran 1 year with ne256 on pm-gpu (both control and plus 4K). That was a March27th checkout. I don't think we've changed the macmic in a while.
/pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se62-mar27/t.se62-mar27.F2010-SCREAMv1.ne256pg2_ne256pg2.pm-gpu.n096t4xX.L128.vth200
/pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se62-mar27/t.se62-mar27.F2010-SCREAMv1.ne256pg2_ne256pg2.pm-gpu.n096t4xX.L128.vth200.plus4k
I verified that in this current ne256 case as well as those noted above, macmic is same at 3.
login18% ./atmquery atmosphere_processes::physics::mac_aero_mic::number_of_subcycles
mac_aero_mic::number_of_subcycles: 3
@mahf708 says he just ran 2 months of ne256 on pm-gpu. Though not Cess-style, I assume it was a recent scream repo.
I keep the hash in my case names 😉 6bb3639
Vanilla F2010-SCREAMv1 with light IO (7 or so 3-hourly 2D variables). I don't recall anything special except 20230522.I2010CRUELM.ne256pg2.elm.r.2013-08-01-00000.nc for land. I mention land because I ran into a lot of trouble with land stuff in EAMf90 ne120pg2 in the past, with "mysterious" errors like the above...
On frontier, using tcclevenger/simulations/cess-production-cherry-pick-merges
(with team barrier), I'm able to run ne256 longer. Currently at model date = 20191113
/lustre/orion/cli115/proj-shared/noel/e3sm_scratch/tcessr-sep15/cess-v1-cntl.ne256pg2_ne256pg2.F2010-SCREAMv1.tcessr-sep15.n0096t8x111661.tb.noSC.newy