E3SM
E3SM copied to clipboard
FATES PIO issue for `f19_g16` resolution `ERP` tests
In the fates
test list we have two debug mode ERP
tests using the f19_g16
resolution for the default set of fates run modes. The difference between the tests is that one runs with the gnu
compiler and one runs intel
. Both of these tests are failing with a PIO error while accessing the restart file:
64: PIO: FATAL ERROR: Aborting... An error occured, Waiting on pending requests on file (./ERP_D_Ld3.f19_g16.IELMFATES.pm-cpu_intel.elm-fates_cold.G.20240329_095917_p8h3b3.elm.r.0001-01-03-00000.nc, ncid=56) failed (Number of pending requests on file = 129, Number of variables with pending requests = 129, Number of request blocks = 2, Current block being waited on = 0, Number of requests in current block = 92).. Size of I/O request exceeds INT_MAX (err=-237). Aborting since the error handler was set to PIO_INTERNAL_ERROR... (/global/u1/g/glemieux/E3SM-project/e3sm/externals/scorpio/src/clib/pio_darray_int.c: 2087)
64: Obtained 10 stack frames.
64: /pscratch/sd/g/glemieux/e3sm-tests/pr6279-fates-basegen.fates.pm-cpu..E7792c63c19-F698a8df8/ERP_D_Ld3.f19_g16.IELMFATES.pm-cpu_intel.elm-fates_cold.G.20240329_095917_p8h3b3/bld/e3sm.exe() [0x5a40be8]
64: /pscratch/sd/g/glemieux/e3sm-tests/pr6279-fates-basegen.fates.pm-cpu..E7792c63c19-F698a8df8/ERP_D_Ld3.f19_g16.IELMFATES.pm-cpu_intel.elm-fates_cold.G.20240329_095917_p8h3b3/bld/e3sm.exe() [0x5a3fc95]
64: /pscratch/sd/g/glemieux/e3sm-tests/pr6279-fates-basegen.fates.pm-cpu..E7792c63c19-F698a8df8/ERP_D_Ld3.f19_g16.IELMFATES.pm-cpu_intel.elm-fates_cold.G.20240329_095917_p8h3b3/bld/e3sm.exe() [0x5a823bf]
64: /pscratch/sd/g/glemieux/e3sm-tests/pr6279-fates-basegen.fates.pm-cpu..E7792c63c19-F698a8df8/ERP_D_Ld3.f19_g16.IELMFATES.pm-cpu_intel.elm-fates_cold.G.20240329_095917_p8h3b3/bld/e3sm.exe() [0x5a73b7c]
64: /pscratch/sd/g/glemieux/e3sm-tests/pr6279-fates-basegen.fates.pm-cpu..E7792c63c19-F698a8df8/ERP_D_Ld3.f19_g16.IELMFATES.pm-cpu_intel.elm-fates_cold.G.20240329_095917_p8h3b3/bld/e3sm.exe() [0x5a827d8]
64: /pscratch/sd/g/glemieux/e3sm-tests/pr6279-fates-basegen.fates.pm-cpu..E7792c63c19-F698a8df8/ERP_D_Ld3.f19_g16.IELMFATES.pm-cpu_intel.elm-fates_cold.G.20240329_095917_p8h3b3/bld/e3sm.exe() [0x5a74cc3]
64: /pscratch/sd/g/glemieux/e3sm-tests/pr6279-fates-basegen.fates.pm-cpu..E7792c63c19-F698a8df8/ERP_D_Ld3.f19_g16.IELMFATES.pm-cpu_intel.elm-fates_cold.G.20240329_095917_p8h3b3/bld/e3sm.exe() [0x5a303f4]
64: /pscratch/sd/g/glemieux/e3sm-tests/pr6279-fates-basegen.fates.pm-cpu..E7792c63c19-F698a8df8/ERP_D_Ld3.f19_g16.IELMFATES.pm-cpu_intel.elm-fates_cold.G.20240329_095917_p8h3b3/bld/e3sm.exe() [0x4616652]
64: /pscratch/sd/g/glemieux/e3sm-tests/pr6279-fates-basegen.fates.pm-cpu..E7792c63c19-F698a8df8/ERP_D_Ld3.f19_g16.IELMFATES.pm-cpu_intel.elm-fates_cold.G.20240329_095917_p8h3b3/bld/e3sm.exe() [0x467faf3]
64: /pscratch/sd/g/glemieux/e3sm-tests/pr6279-fates-basegen.fates.pm-cpu..E7792c63c19-F698a8df8/ERP_D_Ld3.f19_g16.IELMFATES.pm-cpu_intel.elm-fates_cold.G.20240329_095917_p8h3b3/bld/e3sm.exe() [0xb81e4e]
64: MPICH ERROR [Rank 64] [job id 23646208.0] [Fri Mar 29 12:37:52 2024] [nid006735] - Abort(-1) (rank 64 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 64
64:
64: aborting job:
64: application called MPI_Abort(MPI_COMM_WORLD, -1) - process 64
srun: error: nid006735: task 64: Exited with exit code 255
We have a other similar fates ERP
tests that run on ne4pg2_ne4pg2
and f09_g16
that don't seem to hit this issue, although those are not being run in debug mode.
- Is this issue occuring with the latest E3SM master?
- Do you see this issue on other machines (apart from pm)?
- Is the test using PnetCDF or NetCDF for writes (xmlquery for PIO_TYPENAME)?
- How many MPI processes is the test using on PM?
- Is this issue occuring with the latest E3SM master?
Yes, nearly the latest master. This was discovered when generating new fates test list baselines using E3SM v3.0.0-104-g7792c63c19
(commit from 4 days ago) and fates tag sci.1.70.0_api.32.0.0_tools.1.1.0
.
- Do you see this issue on other machines (apart from pm)?
To be determined.
- Is the test using PnetCDF or NetCDF for writes (xmlquery for PIO_TYPENAME)?
Looks like land is using PnetCDF:
PIO_TYPENAME: ['CPL:pnetcdf', 'ATM:netcdf', 'LND:pnetcdf', 'ICE:pnetcdf', 'OCN:pnetcdf', 'ROF:pnetcdf', 'GLC:pnetcdf', 'WAV:pnetcdf', 'IAC:pnetcdf', 'ESP:pnetcdf']
- How many MPI processes is the test using on PM?
128 tasks. Here's the preview_run
output:
CASE INFO:
nodes: 1
total tasks: 128
tasks per node: 128
thread count: 1
ngpus per node: 0
BATCH INFO:
FOR JOB: case.test
ENV:
Setting Environment ADIOS2_ROOT=/global/cfs/cdirs/e3sm/3rdparty/adios2/2.9.1/cray-mpich-8.1.25/gcc-11.2.0
Setting Environment Albany_ROOT=/global/common/software/e3sm/mali_tpls/albany-e3sm-serial-release-gcc
Setting Environment BLA_VENDOR=Generic
Setting Environment FI_CXI_RX_MATCH_MODE=software
Setting Environment GATOR_INITIAL_MB=4000MB
Setting Environment HDF5_USE_FILE_LOCKING=FALSE
Setting Environment MPICH_COLL_SYNC=MPI_Bcast
Setting Environment MPICH_ENV_DISPLAY=1
Setting Environment MPICH_VERSION_DISPLAY=1
Setting Environment NETCDF_PATH=/opt/cray/pe/netcdf-hdf5parallel/4.9.0.3/gnu/9.1
Setting Environment OMP_NUM_THREADS=1
Setting Environment OMP_PLACES=threads
Setting Environment OMP_PROC_BIND=spread
Setting Environment OMP_STACKSIZE=128M
Setting Environment PERL5LIB=/global/cfs/cdirs/e3sm/perl/lib/perl5-only-switch
Setting Environment PNETCDF_PATH=/opt/cray/pe/parallel-netcdf/1.12.3.3/gnu/9.1
Setting Environment Trilinos_ROOT=/global/common/software/e3sm/mali_tpls/trilinos-e3sm-serial-release-gcc
SUBMIT CMD:
sbatch --time 00:31:40 -q regular --account m2420 .case.test
MPIRUN (job=case.test):
srun --label -n 128 -N 1 -c 2 --cpu_bind=cores -m plane=128 /pscratch/sd/g/glemieux/e3sm-tests/pr6279-fates-basegen.fates.pm-cpu..E7792c63c19-F698a8df8/ERP_D_Ld3.f19_g16.IELMFATES.pm-cpu_gnu.elm-fates_cold.G.20240329_093430_nc2qos/bld/e3sm.exe >> e3sm.log.$LID 2>&1
I've confirmed this fails in non-debug mode as well.
Thanks, can you also print out the PIO_BUFFER_SIZE_LIMIT (./xmlquery PIO_BUFFER_SIZE_LIMIT) for the test?
We might be able to overcome this limit by increasing the number of I/O tasks too (setting PIO_NUMTASKS to say 8)
Try adding a testmod (like SMS_Ly2_P1x1.1x1_smallvilleIA.IELMCNCROP.anlgce_gnu.elm-force_netcdf_pio uses -- ./components/elm/cime_config/testdefs/testmods_dirs/elm/force_netcdf_pio) to set the number of I/O tasks for the test to 8 (./xmlchange PIO_NUMTASKS=8; ./xmlchange PIO_STRIDE=-99) and see if it works.
Thanks, can you also print out the PIO_BUFFER_SIZE_LIMIT (./xmlquery PIO_BUFFER_SIZE_LIMIT) for the test?
We might be able to overcome this limit by increasing the number of I/O tasks too (setting PIO_NUMTASKS to say 8)
PIO_BUFFER_SIZE_LIMIT: -1
force_netcdf_pio
I'm sorry, I don't quite understand what you're suggesting here. Do you want me to modify the failing f19_g16
test to use the force_netcdf_pio
tesmod shell script, adding the ./xmlchange
commands you noted to it as well?
No, just add a testmod for the failing ERP test so that you can set the PIO_NUMTASKS to 8 and PIO_STRIDE to -99 (I mentioned the *elm-force_netcdf_pio test to use as a reference on how to add/set testmods for CIME tests).
:tada: That did the trick. The test passes using the above PIO_NUMTASKS
and PIO_STRIDE
values you suggested @jayeshkrishna. What's the next steps for addressing this?
Can you also check if PIO_NUMTASKS=4 works? The solution for this issue would be to set the number of I/O tasks (8 or 4) permanently in a testmod for this test (add the above xmlchange commands to the testmod associated with this test). The value should get reset by E3SM (share utils) if the test is run with less than 8/4 procs.
PIO_NUMTASKS=4
works as well.
This particular testmod, fates_cold
, is used pretty widely across a number of resolutions and also the basis for other testmods. I can create a resolution-specific testmod for this one test for this resolution, but I'm wondering if there are other options for updating the PIO settings without having to tie a testmod to a given resolution.
ok, I will try to recreate the issue and find a fix for it in SCORPIO. Meanwhile, you can add the testmod to get the test working on PM.
Thanks for all your help @jayeshkrishna
You can put "if" statements in the shell_commands file and only take action if its a certain resolution. See this example for the "noio" testmod in eam:
(base) jacob@Roberts-MacAirM2 noio % more shell_commands
#!/bin/bash
./xmlchange --append CAM_CONFIG_OPTS='-cosp'
# save benchmark timing info for provenance
./xmlchange SAVE_TIMING=TRUE
# on KNLs, run hyper-threaded with 64x2
if [ `./xmlquery --value MACH` == theta ]||[ `./xmlquery --value MACH` == cori-knl ]; then
./xmlchange MAX_MPITASKS_PER_NODE=64
./xmlchange MAX_TASKS_PER_NODE=128
./xmlchange NTHRDS=2
# avoid over-decomposing LND beyond 7688 clumps (grid cells)
if [ `./xmlquery --value NTASKS_LND` -gt 3844 ]; then ./xmlchange NTHRDS_LND=1; fi
else
./xmlchange NTHRDS=1
fi
Thanks for the suggestion @rljacob. I forgot I could xmlquery LND_GRID
.
I should note for reference, that this test was working as of 67abd00. It stopped working sometime between then and 069c226.