scream
scream copied to clipboard
cudaErrorIllegalAddress with pm-gpu cdash tests (with `scream-output-preset-5` and 6 -- both have vertical remap yaml outputs)
We've seen this test fail for many days:
ERS_Ln90.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-gpu_gnugpu.scream-small_kernels--scream-output-preset-5
and
ERP_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-gpu_gnugpu.scream-bfbhash--scream-output-preset-6
Trying to narrow down the issue, I see that it looks like it's the scream-output-preset-5
that is likely the culprit. And possibly also shows with scream-output-preset-6
. The test with preset 1,2,3,4 have not seen error.
Additionally, not all attempts hit this error. So there is a chance of getting cuda error with this testmod.
I verified can get same behavior with just SMS: (ie all of these also have failure in most attempts)
SMS_Ln90.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-gpu_gnugpu.scream-output-preset-5
as well as a DEBUG test:
SMS_D_Ln90.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-gpu_gnugpu.scream-output-preset-5
and with only 1 thread:
SMS_PMx1_D_Ln90.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-gpu_gnugpu.scream-output-preset-5
The cuda error does not present at the same timestep either.
perlmutter-login06% pwd
/global/cfs/cdirs/e3sm/ndk/repos/se00-apr23/components/eamxx/cime_config/testdefs/testmods_dirs/scream/output/preset
perlmutter-login06% grep hremap_to_ne4 */*
3/shell_commands:. $SCRIPTS_DIR/hremap_to_ne4/shell_commands
4/shell_commands:. $SCRIPTS_DIR/hremap_to_ne4/shell_commands
6/shell_commands:. $SCRIPTS_DIR/hremap_to_ne4/shell_commands
perlmutter-login06% grep vremap */*
5/shell_commands:. $SCRIPTS_DIR/vremap/shell_commands
6/shell_commands:. $SCRIPTS_DIR/vremap/shell_commands
Sorta points to the issue being in vremap
YAML_FILES=$(ls -1 | grep 'eamxx_.*_output.yaml')
for fname in ${YAML_FILES}; do
$YAML_EDIT_SCRIPT -f $fname --vertical-remap-file \${DIN_LOC_ROOT}/atm/scream/maps/vrt_remapping_p_levs_20230926.nc
done
For the other conus test, I can reproduce with something as simple as:
SMS_P8x1_D_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-gpu_gnugpu.scream-output-preset-5
ie, use SMS, DEBUG, and only use 2 nodes (default is 8) without threading.
Directory where I made many attempts: /pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se00-apr23
The tests seem to pass on pm-cpu (ie I tried those that fail, but not as extensively as above)
Both of the failing tests are passing in @bartgol branch bartgol/eamxx/use-only-scorpio-clib