ginkgo icon indicating copy to clipboard operation
ginkgo copied to clipboard

Ginkgo 1.7.0 tests capture stderr and fail due to different number of mpirun warnings

Open lahwaacz opened this issue 5 months ago • 1 comments

Hi,

I'm creating a stable ginkgo-hpc package for Arch Linux and I'm getting some issues. Besides #1564, #1566 and #1143, there are some tests that fail with the following error:

281/285 Test #283: benchmark_multi_vector_distributed .......................***Failed    1.27 sec
TEST: '/usr/bin/mpiexec' '-n' '3' '/build/ginkgo-hpc/src/build/benchmark/blas/distributed/multi_vector_distributed' '-input' '[{"n": 100}]'
FAIL: stderr differs
---

+++

@@ -1,3 +1,6 @@

+[arch-nspawn-268570:99043] No HIP capabale device found. Disabling component.
+[arch-nspawn-268570:99045] No HIP capabale device found. Disabling component.
+[arch-nspawn-268570:99044] No HIP capabale device found. Disabling component.
 This is Ginkgo 1.7.0 (master)
     running with core module 1.7.0 (master)
 Running on reference(0)

282/285 Test #284: benchmark_spmv_distributed ...............................***Failed    1.27 sec
TEST: '/usr/bin/mpiexec' '-n' '3' '/build/ginkgo-hpc/src/build/benchmark/spmv/distributed/spmv_distributed' '-input' '[{"size": 100, "stencil": "7pt", "comm_pattern": "stencil"}]'
FAIL: stderr differs
---

+++

@@ -1,3 +1,6 @@

+[arch-nspawn-268570:99066] No HIP capabale device found. Disabling component.
+[arch-nspawn-268570:99065] No HIP capabale device found. Disabling component.
+[arch-nspawn-268570:99064] No HIP capabale device found. Disabling component.
 This is Ginkgo 1.7.0 (master)
     running with core module 1.7.0 (master)
 Running on reference(0)

283/285 Test #285: benchmark_solver_distributed .............................***Failed    1.21 sec
TEST: '/build/ginkgo-hpc/src/build/benchmark/solver/distributed/solver_distributed' '-input' '[{"size": 100, "stencil": "7pt", "comm_pattern": "stencil", "optimal": {"spmv": "csr-csr"}}]'
FAIL: stderr differs
---

+++

@@ -1,3 +1,4 @@

+[arch-nspawn-268570:99060] No HIP capabale device found. Disabling component.
 This is Ginkgo 1.7.0 (master)
     running with core module 1.7.0 (master)
 Running on reference(0)

The build system has no GPU, but ROCm/HIP is installed for building the -hip variant of the package. But these tests are built with -DGINKGO_BUILD_HIP=OFF (I know it is pointless to run HIP tests without a GPU).

Arch Linux has ROCm-aware OpenMPI 5.0 and it is responsible for printing the No HIP capabale device found. Disabling component. message from each rank. Hence, if you compare the output of a serial test with that run through mpirun, there will necessarily be a difference. The tests should be designed better, assuming that the MPI library itself does not print anything is rather naive.

lahwaacz avatar Mar 09 '24 16:03 lahwaacz