hdf5 icon indicating copy to clipboard operation
hdf5 copied to clipboard

Fix sporadic h5diff_172 test failure w/ NVHPC

Open derobins opened this issue 8 months ago • 1 comments

We are seeing sporadic test failures in the NVHPC CI action. The one I see is usually h5diff_172, though there may be others. These failures appear to be due to a mkdir call failing and this appears to be a known problem with OpenMPI.

See here:

https://github.com/open-mpi/ompi/issues/8510

We're currently testing with a pretty elderly version of NVHPC (23.9.0) since newer versions have problems with some long double conversions. This version of NVHPC appears to use an older version of OpenMPI (3.1.5 - see the docs: https://docs.nvidia.com/hpc-sdk/archive/23.9/hpc-sdk-release-notes/index.html). They claim that this is fixed in recent versions of OpenMPI and I don't see it on my VMs, where I build with OpenMPI 4.1.5. We don't see this in other parallel test actions since we usually only configure and build for parallel in GitHub CI.

We probably have a few options:

  1. Disable NVHPC in GitHub CI and rely on CDash reporting, like we do for every other compiler w/ parallel HDF5
  2. Add --mca orte_tmpdir_base <dir> to OpenMPI's mpiexec options
  3. Fix the issues w/ long double so #4171 can go in, bumping NVHPC to 24.5, which should give us OpenMPI 4.1.x via HPC-X (I think - this is unclear from casually perusing the docs)

The test failures look like this:

160: Test command: /usr/local/bin/cmake "-D" "TEST_EMULATOR=" "-D" "TEST_PROGRAM=/home/runner/work/hdf5/build/bin/h5diff" "-D" "TEST_ARGS:STRING=-v;h5diff_basic1.h5;h5diff_basic1.h5;/g1/fp20;/g1/fp20_COPY" "-D" "TEST_FOLDER=/home/runner/work/hdf5/build/tools/test/h5diff/testfiles" "-D" "TEST_OUTPUT=h5diff_172.out" "-D" "TEST_EXPECT=0" "-D" "TEST_REFERENCE=h5diff_172.txt" "-D" "TEST_APPEND=EXIT CODE:" "-P" "/home/runner/work/hdf5/hdf5/config/cmake/runTest.cmake"
160: Working Directory: /home/runner/work/hdf5/build/tools/test/h5diff/testfiles
160: Test timeout computed to be: 1200
160: -- Require TEST_EXPECT to be defined
160: -- COMMAND:  /home/runner/work/hdf5/build/bin/h5diff -v;h5diff_basic1.h5;h5diff_basic1.h5;/g1/fp20;/g1/fp20_COPY
160: -- COMMAND Result: 0
160: -- COMMAND Error: 
160: -- COMPARE Result: 0
160: -- /home/runner/work/hdf5/build/bin/h5diff Passed
1632/2920 Test  #160: H5DIFF-h5diff_172 ..........................................................   Passed    0.02 sec
test 161
          Start  161: MPI_TEST_H5DIFF-h5diff_172

161: Test command: /usr/local/bin/cmake "-D" "TEST_PROGRAM=/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/comm_libs/openmpi4/bin/mpiexec" "-D" "TEST_ARGS:STRING=-n;2;--mca;opal_warn_on_missing_libcuda;0;/home/runner/work/hdf5/build/bin/ph5diff;;-v;h5diff_basic1.h5;h5diff_basic1.h5;/g1/fp20;/g1/fp20_COPY" "-D" "TEST_FOLDER=/home/runner/work/hdf5/build/tools/test/h5diff/PAR/testfiles" "-D" "TEST_OUTPUT=h5diff_172.out" "-D" "TEST_EXPECT=0" "-D" "TEST_REFERENCE=h5diff_172.txt" "-D" "TEST_APPEND=EXIT CODE:" "-D" "TEST_REF_APPEND=EXIT CODE: [0-9]" "-D" "TEST_REF_FILTER=EXIT CODE: 0" "-D" "TEST_SORT_COMPARE=TRUE" "-P" "/home/runner/work/hdf5/hdf5/config/cmake/runTest.cmake"
161: Working Directory: /home/runner/work/hdf5/build/tools/test/h5diff/PAR/testfiles
161: Test timeout computed to be: 1200
161: -- Require TEST_EXPECT to be defined
161: -- COMMAND:  /opt/nvidia/hpc_sdk/Linux_x86_64/23.9/comm_libs/openmpi4/bin/mpiexec -n;2;--mca;opal_warn_on_missing_libcuda;0;/home/runner/work/hdf5/build/bin/ph5diff;;-v;h5diff_basic1.h5;h5diff_basic1.h5;/g1/fp20;/g1/fp20_COPY
161: -- COMMAND Result: 1
161: -- Output :
161: EXIT CODE: 1
161: 
161: -- Error Output :
161: --------------------------------------------------------------------------
161: A call to mkdir was unable to create the desired directory:
161: 
161:   Directory: /tmp/ompi.fv-az651-831.1001/pid.37803
161:   Error:     No such file or directory
161: 
161: Please check to ensure you have adequate permissions to perform
161: the desired operation.
161: --------------------------------------------------------------------------
161: [fv-az651-831:37803] [[33398,0],0] ORTE_ERROR_LOG: Error in file ../../orte/util/session_dir.c at line 107
161: [fv-az651-831:37803] [[33398,0],0] ORTE_ERROR_LOG: Error in file ../../orte/util/session_dir.c at line 346
161: --------------------------------------------------------------------------
161: It looks like orte_init failed for some reason; your parallel process is
161: likely to abort.  There are many reasons that a parallel process can
161: fail during orte_init; some of which are due to configuration or
161: environment problems.  This failure appears to be an internal failure;
161: here's some additional information (which may only be relevant to an
161: Open MPI developer):
161: 
161:   orte_session_dir failed
161:   --> Returned value Error (-1) instead of ORTE_SUCCESS
161: --------------------------------------------------------------------------
161: 
161: CMake Error at /home/runner/work/hdf5/hdf5/config/cmake/runTest.cmake:130 (message):
161:   Failed: Test program
161:   /opt/nvidia/hpc_sdk/Linux_x86_64/23.9/comm_libs/openmpi4/bin/mpiexec exited
161:   != 0.
161: 
161: 
161: 
1633/2920 Test  #161: MPI_TEST_H5DIFF-h5diff_172 .................................................***Failed    0.03 sec

derobins avatar Jun 15 '24 04:06 derobins