E3SM icon indicating copy to clipboard operation
E3SM copied to clipboard

Bad dphi in ne256 eamxx runs

Open amametjanov opened this issue 2 months ago • 5 comments

In a 6-month run of --compset F2010-SCREAMv1 --res ne256pg2_ne256pg2 --machine pm-gpu --compiler gnugpu (script below), jobs are running into errors like

$ tail e3sm.log*
219: Bad dphi, dp3d, or vtheta_dp; label: 'DIRK Newton loop nm1'; see hommexx.errlog.256.219
...

$ head hommexx.errlog.256.219
label: DIRK Newton loop nm1
time-level 0            
lat -2.673614897152932e-01 lon  2.998770420815903e+00
ie 936 igll 3 jgll 2 lev 0: bad dphi
level                   dphi                   dp3d              vtheta_dp
    0                   -nan  6.439032159248143e+01  8.787763796686809e+04
...

after ~2 months at YYYYMMDD 20180214. Run-dir:

/pscratch/sd/a/azamat/e3sm_scratch/pm-gpu/bench/ppe/ne256pg2_ne256pg2.F2010-SCREAMv1.pm-gpu_gnugpu.20251027.ppe.n64.t2/run/

Run-script: run.ne256pg2_ne256pg2.F2010-SCREAMv1.sh Yaml inputs:

A similar error occurs on --machine aurora --compiler oneapi-ifxgpu:

$ tail e3sm.log*
x4315c4s2b0n0.hsn.cm.aurora.alcf.anl.gov 669: WARNING: Tl1_1 has 1 values <= allowable value.  Resetting to minimum value.
x4314c4s3b0n0.hsn.cm.aurora.alcf.anl.gov 0: bfbhash>           8172 e2675347aabc7a9e (Hommexx)
x4315c4s2b0n0.hsn.cm.aurora.alcf.anl.gov 669: Bad dphi, dp3d, or vtheta_dp; label: 'CaarFunctorImpl::run TagPreExchange'; see hommexx.errlog.768.669
Exiting...

$  head hommexx.errlog.768.669 
label: CaarFunctorImpl::run TagPreExchange
time-level 1
lat -2.469408023496295e-01 lon  2.399145952253143e+00
ie 166 igll 1 jgll 3 lev 121: bad dphi
level                   dphi                   dp3d              vtheta_dp
    0 -1.750491872851586e+04  6.500425338745119e+01  8.627967553816583e+04
...
  120 -9.885440405719305e+01  3.290536127386427e+02  8.899586850071557e+04
  121  4.141349216841650e+02  3.199318230615362e+02  7.667957395378580e+04
  122 -2.574021162628712e+02  3.177201924738347e+02  6.866978494951430e+04
...

Run-dir:

/lus/flare/projects/E3SM_Dec/azamatm/scratch/profiling/ppe/20251028/ne256pg2_ne256pg2.F2010-SCREAMv1.aurora_oneapi-ifxgpu.20251028.ppe.n64/run/

amametjanov avatar Nov 04 '25 18:11 amametjanov

Thanks, @amametjanov. It's curious that the crash occurs so close to each other in time on the two machines: 20180214 on pm-gpu and 20180221 on aurora.

For context, this is one simulation out of a PPE. I wonder if any of the following parameter tunings are at the edge of their bounds. @hassanbeydoun, any thoughts?

    ./xmlchange --append SCREAM_ATMCHANGE_BUFFER=shoc::thl2tune=0.3794446684347499-ATMCHANGE_SEP-
    ./xmlchange --append SCREAM_ATMCHANGE_BUFFER=shoc::qw2tune=2.596699622837714-ATMCHANGE_SEP-
    ./xmlchange --append SCREAM_ATMCHANGE_BUFFER=shoc::length_fac=6.2156245335659275-ATMCHANGE_SEP-
    ./xmlchange --append SCREAM_ATMCHANGE_BUFFER=shoc::c_diag_3rd_mom=2.478400388025101-ATMCHANGE_SEP-
    ./xmlchange --append SCREAM_ATMCHANGE_BUFFER=shoc::coeff_kh=0.42862425125109477-ATMCHANGE_SEP-
    ./xmlchange --append SCREAM_ATMCHANGE_BUFFER=shoc::coeff_km=0.011762589755682586-ATMCHANGE_SEP-
    ./xmlchange --append SCREAM_ATMCHANGE_BUFFER=shoc::lambda_low=0.008778367749106107-ATMCHANGE_SEP-
    ./xmlchange --append SCREAM_ATMCHANGE_BUFFER=shoc::lambda_high=0.02437801631212044-ATMCHANGE_SEP-
    ./xmlchange --append SCREAM_ATMCHANGE_BUFFER=p3::spa_ccn_to_nc_factor=1545.3138938772083-ATMCHANGE_SEP-
    ./xmlchange --append SCREAM_ATMCHANGE_BUFFER=p3::cldliq_to_ice_collection_factor=0.8753697369370859-ATMCHANGE_SEP-
    ./xmlchange --append SCREAM_ATMCHANGE_BUFFER=p3::rain_to_ice_collection_factor=0.29557141319605734-ATMCHANGE_SEP-
    ./xmlchange --append SCREAM_ATMCHANGE_BUFFER=p3::accretion_prefactor=0.07794159645036809-ATMCHANGE_SEP-
    ./xmlchange --append SCREAM_ATMCHANGE_BUFFER=p3::deposition_nucleation_exponent=0.2655509297949672-ATMCHANGE_SEP-
    ./xmlchange --append SCREAM_ATMCHANGE_BUFFER=p3::max_total_ni=7479060.412486526-ATMCHANGE_SEP-
    ./xmlchange --append SCREAM_ATMCHANGE_BUFFER=p3::ice_sedimentation_factor=1.1654590352430785-ATMCHANGE_SEP-
    ./xmlchange --append SCREAM_ATMCHANGE_BUFFER=p3::rain_selfcollection_breakup_diameter=5.033742437151411e-05-ATMCHANGE_SEP-

crterai avatar Nov 04 '25 20:11 crterai

@crterai, yes that accretion prefactor is very low and we have seen instabilities occur at that lower end.

hassanbeydoun avatar Nov 04 '25 20:11 hassanbeydoun

Thanks for checking, @hassanbeydoun. @amametjanov - seeing that this might be an outlier case that stretches the physics to too much of an extreme, maybe it's worth checking to see if other parameter sets are more successful before we start poring more analysis into the case. If I remember correctly, even in the short couple-day PPE tests with SCREAM ne1024, we saw some of those PPE members crash.

crterai avatar Nov 04 '25 22:11 crterai

Tried an ensemble run of 32 members:

name YYYYMMDD SYPD_or_error-msg
2025-EACB.ne256.NN_64.nyr_1.006f595846ba 20180426 Vertical remap: Negative (or nan) layer thickness detected, aborting!
2025-EACB.ne256.NN_64.nyr_1.00d7d8b8bf2e 20180106 DIRK Newton loop np1 dphi nan
2025-EACB.ne256.NN_64.nyr_1.013df5972706 20180130 Vertical remap: Negative (or nan) layer thickness detected, aborting!
2025-EACB.ne256.NN_64.nyr_1.028d726bb46a 20180130 Vertical remap: Negative (or nan) layer thickness detected, aborting!
2025-EACB.ne256.NN_64.nyr_1.0291e5ca5eb4 20180105 no error msg
2025-EACB.ne256.NN_64.nyr_1.029750c8f925 20180105 no error msg
2025-EACB.ne256.NN_64.nyr_1.0418ddf9b14e 20180114 core-dump
2025-EACB.ne256.NN_64.nyr_1.064136e5fd7d 20180131 Vertical remap: Negative (or nan) layer thickness detected, aborting!
2025-EACB.ne256.NN_64.nyr_1.077c3d311340 20180512 DIRK Newton loop np1 dphi nan, CaarFunctorImpl::run TagPreExchange 
2025-EACB.ne256.NN_64.nyr_1.08901fd69437 20180105 DIRK Newton loop np1 dphi nan
2025-EACB.ne256.NN_64.nyr_1.096a8f627f33 20180201 Vertical remap: Negative (or nan) layer thickness detected, aborting! 
2025-EACB.ne256.NN_64.nyr_1.097239327373 20180603 no error msg
2025-EACB.ne256.NN_64.nyr_1.0a87e09343a9 20180122 core-dump
2025-EACB.ne256.NN_64.nyr_1.0ae27d3953ae 20180220 CaarFunctorImpl::run TagPreExchange bad dphi
2025-EACB.ne256.NN_64.nyr_1.0afeeaf5f3bc 20180113 Vertical remap: Negative (or nan) layer thickness detected, aborting!
2025-EACB.ne256.NN_64.nyr_1.0b03a9edfc12 20180202 CaarFunctorImpl::run TagPreExchange bad dphi
2025-EACB.ne256.NN_64.nyr_1.0b5c3dbcd13c 20180205 CaarFunctorImpl::run TagPreExchange bad dphi
2025-EACB.ne256.NN_64.nyr_1.0dbb60f76219 20180211 core-dump
2025-EACB.ne256.NN_64.nyr_1.0fe3f0d5e80d 20180106 Vertical remap: Negative (or nan) layer thickness detected, aborting!
2025-EACB.ne256.NN_64.nyr_1.11111947b6cd 20180508 Vertical remap: Negative (or nan) layer thickness detected, aborting!
2025-EACB.ne256.NN_64.nyr_1.11dac763512f 20180104 DIRK Newton loop np1 dphi nan 
2025-EACB.ne256.NN_64.nyr_1.137b7b4d9f81 20180225 Vertical remap: Negative (or nan) layer thickness detected, aborting!
2025-EACB.ne256.NN_64.nyr_1.139c2b35115b 20180526 Vertical remap: Negative (or nan) layer thickness detected, aborting!
2025-EACB.ne256.NN_64.nyr_1.15d1bda56c9b 20180103 Vertical remap: Negative (or nan) layer thickness detected, aborting!
2025-EACB.ne256.NN_64.nyr_1.177ed6265d8e 20180221 Vertical remap: Negative (or nan) layer thickness detected, aborting!
2025-EACB.ne256.NN_64.nyr_1.182cbb0a4d6a 20180321 core-dump
2025-EACB.ne256.NN_64.nyr_1.19c096a053d1 20180104 Vertical remap: Negative (or nan) layer thickness detected, aborting!
2025-EACB.ne256.NN_64.nyr_1.1bf03b5b6a2e 20171228 core-dump
2025-EACB.ne256.NN_64.nyr_1.0c66edae2dc6 20180628 1.275 SYPD
2025-EACB.ne256.NN_64.nyr_1.0da27fa6fad6 20180628 1.274 SYPD
2025-EACB.ne256.NN_64.nyr_1.12143e64c853 20180628 1.264 SYPD
2025-EACB.ne256.NN_64.nyr_1.153fee23cda2 20180628 1.266 SYPD

An ensemble member that ran to completion:

azamatm@aurora-uan-0011:/lus/flare/projects/E3SM_Dec/prod/ppe-20251106
> grep "SCREAM_ATMCHANGE_BUFFER" casedirs/2025-EACB.ne256.NN_64.nyr_1.0c66edae2dc6/case_scripts/replay.sh 
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=shoc::thl2tune=8.142363893597501-ATMCHANGE_SEP-
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=shoc::qw2tune=0.7203055946442998-ATMCHANGE_SEP-
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=shoc::length_fac=9.454610443074653-ATMCHANGE_SEP-
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=shoc::c_diag_3rd_mom=0.35187920684421303-ATMCHANGE_SEP-
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=shoc::coeff_kh=0.41983399792517645-ATMCHANGE_SEP-
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=shoc::coeff_km=0.10354116412257443-ATMCHANGE_SEP-
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=shoc::lambda_low=0.05999469854943132-ATMCHANGE_SEP-
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=shoc::lambda_high=0.0662622411232151-ATMCHANGE_SEP-
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=p3::spa_ccn_to_nc_factor=3416.3319421827678-ATMCHANGE_SEP-
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=p3::cldliq_to_ice_collection_factor=0.20880521705344768-ATMCHANGE_SEP-
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=p3::rain_to_ice_collection_factor=0.5566389049553738-ATMCHANGE_SEP-
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=p3::accretion_prefactor=20.253128542391128-ATMCHANGE_SEP-
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=p3::deposition_nucleation_exponent=0.2747254103168517-ATMCHANGE_SEP-
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=p3::max_total_ni=1579626.4391110286-ATMCHANGE_SEP-
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=p3::ice_sedimentation_factor=0.14299476941837053-ATMCHANGE_SEP-
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=p3::rain_selfcollection_breakup_diameter=0.00041542564146721943-ATMCHANGE_SEP-
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=p3::autoconversion_qc_exponent=97.23123790976979-ATMCHANGE_SEP-
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=p3::autoconversion_prefactor=3.4420577372416474-ATMCHANGE_SEP-
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=p3::autoconversion_radius=3.4570150321516136e-05-ATMCHANGE_SEP-
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=scorpio::output_yaml_files=1ma_ne30pg2.yaml\,51hi.yaml\,3ha_ne30pg2.yaml-ATMCHANGE_SEP-
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=initial_conditions::filename=/lus/flare/projects/E3SM_Dec/whannah/HICCUP/HICCUP.atm_era5.2017-12-27.ne256np4.L128.nc-ATMCHANGE_SEP-
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=physics::atm_procs_list=mac_aero_mic\,rrtmgp\,cosp-ATMCHANGE_SEP-

This member aborted at 20180426 with Vertical remap: Negative (or nan) layer thickness detected:

azamatm@aurora-uan-0011:/lus/flare/projects/E3SM_Dec/prod/ppe-20251106
> grep "SCREAM_ATMCHANGE_BUFFER" casedirs/2025-EACB.ne256.NN_64.nyr_1.028d726bb46a/case_scripts/replay.sh 
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=shoc::thl2tune=4.046055426059314-ATMCHANGE_SEP-
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=shoc::qw2tune=8.05206237506538-ATMCHANGE_SEP-
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=shoc::length_fac=3.3319987910219147-ATMCHANGE_SEP-
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=shoc::c_diag_3rd_mom=0.43936624048421874-ATMCHANGE_SEP-
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=shoc::coeff_kh=0.20349265208905393-ATMCHANGE_SEP-
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=shoc::coeff_km=0.03603674218305022-ATMCHANGE_SEP-
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=shoc::lambda_low=0.05923623056445737-ATMCHANGE_SEP-
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=shoc::lambda_high=0.06401434388801919-ATMCHANGE_SEP-
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=p3::spa_ccn_to_nc_factor=454.27228520763305-ATMCHANGE_SEP-
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=p3::cldliq_to_ice_collection_factor=0.563601407608335-ATMCHANGE_SEP-
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=p3::rain_to_ice_collection_factor=0.15555185430789775-ATMCHANGE_SEP-
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=p3::accretion_prefactor=70.72628710173795-ATMCHANGE_SEP-
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=p3::deposition_nucleation_exponent=0.26550892164367196-ATMCHANGE_SEP-
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=p3::max_total_ni=3217857.5530263325-ATMCHANGE_SEP-
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=p3::ice_sedimentation_factor=1.4636018970894915-ATMCHANGE_SEP-
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=p3::rain_selfcollection_breakup_diameter=0.0003728306166824208-ATMCHANGE_SEP-
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=p3::autoconversion_qc_exponent=68.97932022922188-ATMCHANGE_SEP-
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=p3::autoconversion_prefactor=3.717543569624827-ATMCHANGE_SEP-
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=p3::autoconversion_radius=4.10939778337642e-05-ATMCHANGE_SEP-
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=scorpio::output_yaml_files=1ma_ne30pg2.yaml\,51hi.yaml\,3ha_ne30pg2.yaml-ATMCHANGE_SEP-
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=initial_conditions::filename=/lus/flare/projects/E3SM_Dec/whannah/HICCUP/HICCUP.atm_era5.2017-12-27.ne256np4.L128.nc-ATMCHANGE_SEP-
./xmlchange --append SCREAM_ATMCHANGE_BUFFER=physics::atm_procs_list=mac_aero_mic\,rrtmgp\,cosp-ATMCHANGE_SEP-

amametjanov avatar Nov 11 '25 21:11 amametjanov

layer intersection problem, first error at TOM, second error near the surface (layer 122). I dont recall ever seeing this kind of failure before. could be a timestep CFL issue?

If it's easy, could you run it hydrostatic mode (theta_hydrostatic=.true., tstep_type=5) and see if it also crashes? Depending on if that run crashes, it will help isolate where to look for CFL issues.

mt5555 avatar Nov 11 '25 23:11 mt5555