scream icon indicating copy to clipboard operation
scream copied to clipboard

cudaErrorIllegalAddress with pm-gpu cdash tests (with `scream-output-preset-5` and 6 -- both have vertical remap yaml outputs)

Open ndkeen opened this issue 10 months ago • 1 comments

We've seen this test fail for many days: ERS_Ln90.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-gpu_gnugpu.scream-small_kernels--scream-output-preset-5 and ERP_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-gpu_gnugpu.scream-bfbhash--scream-output-preset-6

Trying to narrow down the issue, I see that it looks like it's the scream-output-preset-5 that is likely the culprit. And possibly also shows with scream-output-preset-6. The test with preset 1,2,3,4 have not seen error.

Additionally, not all attempts hit this error. So there is a chance of getting cuda error with this testmod.

I verified can get same behavior with just SMS: (ie all of these also have failure in most attempts) SMS_Ln90.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-gpu_gnugpu.scream-output-preset-5

as well as a DEBUG test: SMS_D_Ln90.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-gpu_gnugpu.scream-output-preset-5

and with only 1 thread: SMS_PMx1_D_Ln90.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-gpu_gnugpu.scream-output-preset-5

The cuda error does not present at the same timestep either.

perlmutter-login06% pwd
/global/cfs/cdirs/e3sm/ndk/repos/se00-apr23/components/eamxx/cime_config/testdefs/testmods_dirs/scream/output/preset
perlmutter-login06% grep hremap_to_ne4 */*
3/shell_commands:. $SCRIPTS_DIR/hremap_to_ne4/shell_commands
4/shell_commands:. $SCRIPTS_DIR/hremap_to_ne4/shell_commands
6/shell_commands:. $SCRIPTS_DIR/hremap_to_ne4/shell_commands
perlmutter-login06% grep vremap */*
5/shell_commands:. $SCRIPTS_DIR/vremap/shell_commands
6/shell_commands:. $SCRIPTS_DIR/vremap/shell_commands

Sorta points to the issue being in vremap

YAML_FILES=$(ls -1 | grep 'eamxx_.*_output.yaml')
for fname in ${YAML_FILES}; do
  $YAML_EDIT_SCRIPT -f $fname --vertical-remap-file \${DIN_LOC_ROOT}/atm/scream/maps/vrt_remapping_p_levs_20230926.nc
done

For the other conus test, I can reproduce with something as simple as: SMS_P8x1_D_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-gpu_gnugpu.scream-output-preset-5 ie, use SMS, DEBUG, and only use 2 nodes (default is 8) without threading.

Directory where I made many attempts: /pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se00-apr23

The tests seem to pass on pm-cpu (ie I tried those that fail, but not as extensively as above)

ndkeen avatar Apr 25 '24 16:04 ndkeen

Both of the failing tests are passing in @bartgol branch bartgol/eamxx/use-only-scorpio-clib

ndkeen avatar May 04 '24 18:05 ndkeen