E3SM icon indicating copy to clipboard operation
E3SM copied to clipboard

Read failure with OpenMPI lock files generated on GPFS file system of Chrysalis

Open dqwu opened this issue 2 years ago • 9 comments

Recently, some tests run on Chrysalis failed when reading a specific NetCDF input file, with some lock files generated in the input directory. The E3SM PET_Ln9_PS.ne30pg2_EC30to60E2r2.WCYCL1850.chrysalis_intel.allactive-mach-pet test for example failed with the following error message,

256: PIO: FATAL ERROR: Aborting... An error occured, Reading variable (hyai, varid=5) from file (/lcrc/group/e3sm/data/inputdata/atm/cam/inic/homme/cami_mam3_L
inoz_ne30np4_L72_c160214.nc, ncid=16) failed. Starting independent (across processes) access failed on the file. err=-204. Aborting since the error handler was 
set to PIO_INTERNAL_ERROR... (/gpfs/fs1/home/ac.brhillman/codes/e3sm/branches/next/externals/scorpio/src/clib/pio_getput_int.c: 958)

It seems to be a race condition as this issue is not always reproducible. If you can figure out what input file the model was trying to read, check the inputdata dir for .lock files. There may also be some in your run directory.

-rw-r--r-- 1 ac.jgfouca    E3SM 8 Feb 17 23:27 /lcrc/group/e3sm/data/inputdata/atm/cam/inic/homme/cami_mam3_Linoz_ne30np4_L72_c160214.nc-1115488256-2337493.loc
-rw-r--r-- 1 ac.ndkeen     E3SM 8 Feb 17 23:30 /lcrc/group/e3sm/data/inputdata/atm/cam/inic/homme/cami_mam3_Linoz_ne30np4_L72_c160214.nc-1117061120-2338509.loc
-rw-r--r-- 1 ac.onguba     E3SM 8 Feb 17 23:32 /lcrc/group/e3sm/data/inputdata/atm/cam/inic/homme/cami_mam3_Linoz_ne30np4_L72_c160214.nc-1117257728-2339568.loc
-rw-r--r-- 1 jayesh        E3SM 8 Feb 17 23:35 /lcrc/group/e3sm/data/inputdata/atm/cam/inic/homme/cami_mam3_Linoz_ne30np4_L72_c160214.nc-1117323264-2340199.loc
-rw-r--r-- 1 ac.brhillman  E3SM 8 Feb 17 23:37 /lcrc/group/e3sm/data/inputdata/atm/cam/inic/homme/cami_mam3_Linoz_ne30np4_L72_c160214.nc-1117454336-2340833.loc
-rw-r--r-- 1 wuda          E3SM 8 Feb 17 23:39 /lcrc/group/e3sm/data/inputdata/atm/cam/inic/homme/cami_mam3_Linoz_ne30np4_L72_c160214.nc-1118240768-2341335.loc

These tests were run with 28 nodes (1792 MPI tasks, 64 tasks per node). One of them is a simple SCORPIO test program which just opens cami_mam3_Linoz_ne30np4_L72_c160214.nc and tries to read a specific variable. The file is in a directory that is group-writable. To reproduce, there needs to be a few other jobs running all trying to read that file but its not always reproduceable.

The failure is from MPI-IO API MPI_File_open(), which is called by PnetCDF API ncmpi_begin_indep_data(): MPI error (MPI_File_open) : MPI_ERR_OTHER: known error not in list

It has been confirmed that these lock files are created by OpenMPI lib. Making inic/homme read-only has just moved the problem. .loc files are appearing again in relation to trying to read /lcrc/group/e3sm/data/inputdata/atm/cam/topo/USGS-gtopo30_ne30np4pg2_16xdel2.c20200108.nc

A related issue has been created for OpenMPI developers, see https://github.com/open-mpi/ompi/issues/10053 According to the feedback, this issue might have been fixed by OpenMPI 4.1.2

Chrysalis might build latest OpenMPI 4.1.2 module for testing.

module spider openmpi

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  openmpi:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     Versions:
        openmpi/2.1.5-tu6snxn
        openmpi/4.0.4-hpcx-cy5n3ft
        openmpi/4.0.4-hpcx-derf3y3
        openmpi/4.0.4-hpcx-hghvhj5
        openmpi/4.1.1a1-hpcx-2.8.0-u4oxaa4
        openmpi/4.1.1-qiqkjbu
        openmpi/4.1.1-73gbwq4

dqwu avatar Feb 28 '22 17:02 dqwu

There were a couple of fail modes: In the first mode, the model would hang until it ran out of time and slrum killed it. This message was in the log file:

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
   0: slurmstepd: error: *** STEP 146936.0 ON chr-0121 CANCELLED AT 2022-02-14T02:29:15 DUE TO TIME LIMIT ***
srun: got SIGCONT

In the second mode, it errors immediately with this message:

256: PIO: FATAL ERROR: Aborting... An error occured, Reading variable (hyai, varid=5) from file (/lcrc/group/e3sm/data/inputdata/atm/cam/inic/homme/cami_mam3_L
inoz_ne30np4_L72_c160214.nc, ncid=16) failed. Starting independent (across processes) access failed on the file. err=-204. Aborting since the error handler was 
set to PIO_INTERNAL_ERROR... (/gpfs/fs1/home/ac.brhillman/codes/e3sm/branches/next/externals/scorpio/src/clib/pio_getput_int.c: 958)

On this test: PET_Ln9_PS.ne30pg2_EC30to60E2r2.WCYCL1850.chrysalis_intel.allactive-mach-pet.

rljacob avatar Feb 28 '22 21:02 rljacob

There were a couple of fail modes: In the first mode, the model would hang until it ran out of time and slrum killed it. This message was in the log file:

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
   0: slurmstepd: error: *** STEP 146936.0 ON chr-0121 CANCELLED AT 2022-02-14T02:29:15 DUE TO TIME LIMIT ***
srun: got SIGCONT

In the second mode, it errors immediately with this message:

256: PIO: FATAL ERROR: Aborting... An error occured, Reading variable (hyai, varid=5) from file (/lcrc/group/e3sm/data/inputdata/atm/cam/inic/homme/cami_mam3_L
inoz_ne30np4_L72_c160214.nc, ncid=16) failed. Starting independent (across processes) access failed on the file. err=-204. Aborting since the error handler was 
set to PIO_INTERNAL_ERROR... (/gpfs/fs1/home/ac.brhillman/codes/e3sm/branches/next/externals/scorpio/src/clib/pio_getput_int.c: 958)

On this test: PET_Ln9_PS.ne30pg2_EC30to60E2r2.WCYCL1850.chrysalis_intel.allactive-mach-pet.

Were the .loc files generated in the first mode?

dqwu avatar Feb 28 '22 21:02 dqwu

yes

rljacob avatar Feb 28 '22 21:02 rljacob

Even after PR #4823 updated the OpenMPI library (from 4.1.1 to 4.1.2), some cases are still seeing .loc files (cc: @tangq ). I understand @jayeshkrishna is fixing SCORPIO to avoid this but does this mean OpenMPI still has a bug?

rljacob avatar Mar 21 '22 05:03 rljacob

Yes, there is still a bug in OpenMPI causing this issue

jayeshkrishna avatar Mar 21 '22 13:03 jayeshkrishna

The OpenMPI devs just mentioned that OpenMPI 4.1.3 (to be released this week) has some fixes that might be relevant to this issue

jayeshkrishna avatar Mar 21 '22 17:03 jayeshkrishna

@dqwu is working on avoiding calls in SCORPIO that generate (cause OpenMPI to generate) these lock files - https://github.com/E3SM-Project/scorpio/pull/472

jayeshkrishna avatar Mar 31 '22 20:03 jayeshkrishna

Read failure seems to be fixed now with latest SCORPIO release. For write, even with OpenMPI 4.1.3, we can still reproduce similar issues with SCORPIO's replay tool (use SUBSET rearranger). A related issue has been created for OpenMPI developers, see https://github.com/open-mpi/ompi/issues/10297

dqwu avatar Apr 19 '22 21:04 dqwu

However E3SM runs should hopefully not be affected by the OpenMPI 4.1.3 bug above since all E3SM runs use the BOX rearranger in SCORPIO.

jayeshkrishna avatar Apr 19 '22 21:04 jayeshkrishna