E3SM
E3SM copied to clipboard
Read failure with OpenMPI lock files generated on GPFS file system of Chrysalis
Recently, some tests run on Chrysalis failed when reading a specific NetCDF input file, with some lock files generated in the input directory. The E3SM PET_Ln9_PS.ne30pg2_EC30to60E2r2.WCYCL1850.chrysalis_intel.allactive-mach-pet test for example failed with the following error message,
256: PIO: FATAL ERROR: Aborting... An error occured, Reading variable (hyai, varid=5) from file (/lcrc/group/e3sm/data/inputdata/atm/cam/inic/homme/cami_mam3_L
inoz_ne30np4_L72_c160214.nc, ncid=16) failed. Starting independent (across processes) access failed on the file. err=-204. Aborting since the error handler was
set to PIO_INTERNAL_ERROR... (/gpfs/fs1/home/ac.brhillman/codes/e3sm/branches/next/externals/scorpio/src/clib/pio_getput_int.c: 958)
It seems to be a race condition as this issue is not always reproducible. If you can figure out what input file the model was trying to read, check the inputdata dir for .lock files. There may also be some in your run directory.
-rw-r--r-- 1 ac.jgfouca E3SM 8 Feb 17 23:27 /lcrc/group/e3sm/data/inputdata/atm/cam/inic/homme/cami_mam3_Linoz_ne30np4_L72_c160214.nc-1115488256-2337493.loc
-rw-r--r-- 1 ac.ndkeen E3SM 8 Feb 17 23:30 /lcrc/group/e3sm/data/inputdata/atm/cam/inic/homme/cami_mam3_Linoz_ne30np4_L72_c160214.nc-1117061120-2338509.loc
-rw-r--r-- 1 ac.onguba E3SM 8 Feb 17 23:32 /lcrc/group/e3sm/data/inputdata/atm/cam/inic/homme/cami_mam3_Linoz_ne30np4_L72_c160214.nc-1117257728-2339568.loc
-rw-r--r-- 1 jayesh E3SM 8 Feb 17 23:35 /lcrc/group/e3sm/data/inputdata/atm/cam/inic/homme/cami_mam3_Linoz_ne30np4_L72_c160214.nc-1117323264-2340199.loc
-rw-r--r-- 1 ac.brhillman E3SM 8 Feb 17 23:37 /lcrc/group/e3sm/data/inputdata/atm/cam/inic/homme/cami_mam3_Linoz_ne30np4_L72_c160214.nc-1117454336-2340833.loc
-rw-r--r-- 1 wuda E3SM 8 Feb 17 23:39 /lcrc/group/e3sm/data/inputdata/atm/cam/inic/homme/cami_mam3_Linoz_ne30np4_L72_c160214.nc-1118240768-2341335.loc
These tests were run with 28 nodes (1792 MPI tasks, 64 tasks per node). One of them is a simple SCORPIO test program which just opens cami_mam3_Linoz_ne30np4_L72_c160214.nc and tries to read a specific variable. The file is in a directory that is group-writable. To reproduce, there needs to be a few other jobs running all trying to read that file but its not always reproduceable.
The failure is from MPI-IO API MPI_File_open(), which is called by PnetCDF API ncmpi_begin_indep_data(): MPI error (MPI_File_open) : MPI_ERR_OTHER: known error not in list
It has been confirmed that these lock files are created by OpenMPI lib. Making inic/homme read-only has just moved the problem. .loc files are appearing again in relation to trying to read /lcrc/group/e3sm/data/inputdata/atm/cam/topo/USGS-gtopo30_ne30np4pg2_16xdel2.c20200108.nc
A related issue has been created for OpenMPI developers, see https://github.com/open-mpi/ompi/issues/10053 According to the feedback, this issue might have been fixed by OpenMPI 4.1.2
Chrysalis might build latest OpenMPI 4.1.2 module for testing.
module spider openmpi
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
openmpi:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Versions:
openmpi/2.1.5-tu6snxn
openmpi/4.0.4-hpcx-cy5n3ft
openmpi/4.0.4-hpcx-derf3y3
openmpi/4.0.4-hpcx-hghvhj5
openmpi/4.1.1a1-hpcx-2.8.0-u4oxaa4
openmpi/4.1.1-qiqkjbu
openmpi/4.1.1-73gbwq4
There were a couple of fail modes: In the first mode, the model would hang until it ran out of time and slrum killed it. This message was in the log file:
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
0: slurmstepd: error: *** STEP 146936.0 ON chr-0121 CANCELLED AT 2022-02-14T02:29:15 DUE TO TIME LIMIT ***
srun: got SIGCONT
In the second mode, it errors immediately with this message:
256: PIO: FATAL ERROR: Aborting... An error occured, Reading variable (hyai, varid=5) from file (/lcrc/group/e3sm/data/inputdata/atm/cam/inic/homme/cami_mam3_L
inoz_ne30np4_L72_c160214.nc, ncid=16) failed. Starting independent (across processes) access failed on the file. err=-204. Aborting since the error handler was
set to PIO_INTERNAL_ERROR... (/gpfs/fs1/home/ac.brhillman/codes/e3sm/branches/next/externals/scorpio/src/clib/pio_getput_int.c: 958)
On this test: PET_Ln9_PS.ne30pg2_EC30to60E2r2.WCYCL1850.chrysalis_intel.allactive-mach-pet.
There were a couple of fail modes: In the first mode, the model would hang until it ran out of time and slrum killed it. This message was in the log file:
srun: Job step aborted: Waiting up to 32 seconds for job step to finish. 0: slurmstepd: error: *** STEP 146936.0 ON chr-0121 CANCELLED AT 2022-02-14T02:29:15 DUE TO TIME LIMIT *** srun: got SIGCONT
In the second mode, it errors immediately with this message:
256: PIO: FATAL ERROR: Aborting... An error occured, Reading variable (hyai, varid=5) from file (/lcrc/group/e3sm/data/inputdata/atm/cam/inic/homme/cami_mam3_L inoz_ne30np4_L72_c160214.nc, ncid=16) failed. Starting independent (across processes) access failed on the file. err=-204. Aborting since the error handler was set to PIO_INTERNAL_ERROR... (/gpfs/fs1/home/ac.brhillman/codes/e3sm/branches/next/externals/scorpio/src/clib/pio_getput_int.c: 958)
On this test: PET_Ln9_PS.ne30pg2_EC30to60E2r2.WCYCL1850.chrysalis_intel.allactive-mach-pet.
Were the .loc files generated in the first mode?
yes
Even after PR #4823 updated the OpenMPI library (from 4.1.1 to 4.1.2), some cases are still seeing .loc files (cc: @tangq ). I understand @jayeshkrishna is fixing SCORPIO to avoid this but does this mean OpenMPI still has a bug?
Yes, there is still a bug in OpenMPI causing this issue
The OpenMPI devs just mentioned that OpenMPI 4.1.3 (to be released this week) has some fixes that might be relevant to this issue
@dqwu is working on avoiding calls in SCORPIO that generate (cause OpenMPI to generate) these lock files - https://github.com/E3SM-Project/scorpio/pull/472
Read failure seems to be fixed now with latest SCORPIO release. For write, even with OpenMPI 4.1.3, we can still reproduce similar issues with SCORPIO's replay tool (use SUBSET rearranger). A related issue has been created for OpenMPI developers, see https://github.com/open-mpi/ompi/issues/10297
However E3SM runs should hopefully not be affected by the OpenMPI 4.1.3 bug above since all E3SM runs use the BOX rearranger in SCORPIO.