scalapack
scalapack copied to clipboard
xshseqr and xdhseqr fail with FPE if run in parallel
In current master, two tests fail if run in parallel:
69/70 Testing: xshseqr
69/70 Test: xshseqr
Command: "/sw/env/gcc-10.3.0/openmpi/4.1.1/bin/mpiexec" "-n" "2" "./xshseqr"
Directory: /home/rrztest/src/scalapack/TESTING
"xshseqr" start time: Jul 25 20:04 CEST
Output:
----------------------------------------------------------
ScaLAPACK Test for PSHSEQR
epsilon = 5.96046448E-08
threshold = 30.0000000
Residual and Orthogonality Residual computed by:
Residual = || T - Q^T*A*Q ||_F / ( ||A||_F * eps * sqrt(N) )
Orthogonality = MAX( || I - Q^T*Q ||_F, || I - Q*Q^T ||_F ) / (eps * N)
Test passes if both residuals are less then threshold
N NB P Q QR Time CHECK
----- --- ---- ---- -------- ------
Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
Backtrace for this error:
#0 0x151fa27c93ff in ???
#1 0x151fa455124f in pstrord_
at /home/rrztest/src/scalapack/SRC/pstrord.f:1087
#2 0x151fa457a300 in pslaqr3_
at /home/rrztest/src/scalapack/SRC/pslaqr3.f:880
#3 0x151fa4565178 in pslaqr0_
at /home/rrztest/src/scalapack/SRC/pslaqr0.f:598
#4 0x151fa456209d in pshseqr_
at /home/rrztest/src/scalapack/SRC/pshseqr.f:441
#5 0x4036cf in pshseqrdriver
at /home/rrztest/src/scalapack/TESTING/EIG/pshseqrdriver.f:413
#6 0x404427 in main
at /home/rrztest/src/scalapack/TESTING/EIG/pshseqrdriver.f:565
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 0 on node node002 exited on signal 8 (Floating point exception).
--------------------------------------------------------------------------
<end of output>
Test time = 2.91 sec
----------------------------------------------------------
Test Failed.
"xshseqr" end time: Jul 25 20:04 CEST
"xshseqr" time elapsed: 00:00:02
----------------------------------------------------------
70/70 Testing: xdhseqr
70/70 Test: xdhseqr
Command: "/sw/env/gcc-10.3.0/openmpi/4.1.1/bin/mpiexec" "-n" "2" "./xdhseqr"
Directory: /home/rrztest/src/scalapack/TESTING
"xdhseqr" start time: Jul 25 20:04 CEST
Output:
----------------------------------------------------------
ScaLAPACK Test for PDHSEQR
epsilon = 1.1102230246251565E-016
threshold = 30.000000000000000
Residual and Orthogonality Residual computed by:
Residual = || T - Q^T*A*Q ||_F / ( ||A||_F * eps * sqrt(N) )
Orthogonality = MAX( || I - Q^T*Q ||_F, || I - Q*Q^T ||_F ) / (eps * N)
Test passes if both residuals are less then threshold
N NB P Q QR Time CHECK
----- --- ---- ---- -------- ------
Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
Backtrace for this error:
#0 0x1488be0113ff in ???
#1 0x1488bff4ebae in pdtrord_
at /home/rrztest/src/scalapack/SRC/pdtrord.f:1087
#2 0x1488bff77f2f in pdlaqr3_
at /home/rrztest/src/scalapack/SRC/pdlaqr3.f:878
#3 0x1488bff62d2b in pdlaqr0_
at /home/rrztest/src/scalapack/SRC/pdlaqr0.f:598
#4 0x1488bff5fc1d in pdhseqr_
at /home/rrztest/src/scalapack/SRC/pdhseqr.f:441
#5 0x4036e2 in pdhseqrdriver
at /home/rrztest/src/scalapack/TESTING/EIG/pdhseqrdriver.f:412
#6 0x404445 in main
at /home/rrztest/src/scalapack/TESTING/EIG/pdhseqrdriver.f:564
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 0 on node node002 exited on signal 8 (Floating point exception).
--------------------------------------------------------------------------
<end of output>
Test time = 2.70 sec
----------------------------------------------------------
Test Failed.
"xdhseqr" end time: Jul 25 20:04 CEST
"xdhseqr" time elapsed: 00:00:02
----------------------------------------------------------
End testing: Jul 25 20:04 CEST
Both tests pass fine with -n 1
. I tested on two machines with differing compilers and MPI versions (4.1.1 and 1.10.7).
I observe weirdly long runtimes (hundreds of seconds) for some 2.2.0 tests when run inside the pkgsrc build framework, but they do succeed eventually. These FPEs are more definite.
Thanks for the bug report! More details about this bug:
- It was not detected because some tests were disabled in the Github Actions. Now they are enabled, see 782e739f8eb0e7f4d51ad7dd23fc1d03dc99d240. (My bad, I shouldn't commit directly to the repository. To avoid that, I have just enabled the rule "Require a pull request before merging".)
- The code breaks at
https://github.com/Reference-ScaLAPACK/scalapack/blob/de3919e49ab7c47b76248a9123ed448305ee84a6/SRC/pstrord.f#L1087-L1088
-
It breaks because
NWIN = LIHI - I + 1
assumes value0
during the execution of the test. -
In my Linux machine, I added the following prints for debugging purposes:
*
WRITE(*,*) "NWIN = ", NWIN, ", ILO = ", ILO, ", LIHI = ",
$ LIHI, ", I = ", I
*
IF( FLOPS.NE.0 .AND.
$ ( FLOPS*100 ) / ( 2*NWIN*NWIN ) .GE. MMULT ) THEN
*
Result of the tests:
$ mpiexec -n 2 xshseqr
[...]
NWIN = 4 , ILO = 73 , LIHI = 76 , I = 73
NWIN = 4 , ILO = 73 , LIHI = 76 , I = 73
NWIN = 4 , ILO = 73 , LIHI = 76 , I = 73
NWIN = 4 , ILO = 73 , LIHI = 74 , I = 73
NWIN = 19 , ILO = 31 , LIHI = 49 , I = 31
NWIN = 19 , ILO = 31 , LIHI = 49 , I = 31
NWIN = 19 , ILO = 31 , LIHI = 49 , I = 31
NWIN = 19 , ILO = 31 , LIHI = 44 , I = 31
NWIN = 0 , ILO = 45 , LIHI = 51 , I = 52
Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
Backtrace for this error:
#0 0x7fd360623d21 in ???
#1 0x7fd360622ef5 in ???
#2 0x7fd36045408f in ???
at /build/glibc-SzIz7B/glibc-2.31/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
#3 0x7fd362e1ecb6 in pstrord_
at ../SRC/pstrord.f:1090
#4 0x7fd362e4acc5 in pslaqr3_
at ../SRC/pslaqr3.f:880
#5 0x7fd362e340da in pslaqr0_
at ../SRC/pslaqr0.f:598
#6 0x7fd362e30dbf in pshseqr_
at ../SRC/pshseqr.f:441
#7 0x558c3509e9ac in pshseqrdriver
at ../TESTING/EIG/pshseqrdriver.f:413
#8 0x558c3509f8bd in main
at ../TESTING/EIG/pshseqrdriver.f:565
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 0 on node weslleyp-XPS-15-9510 exited on signal 8 (Floating point exception).
--------------------------------------------------------------------------
Minor update to my last message: All tests still pass in the Github Actions, see https://github.com/Reference-ScaLAPACK/scalapack/actions/runs/2735265869.
Test xshseqr
is still failing in my personal machine.
So Github Actions are not actually using multiple CPU cores?
So Github Actions are not actually using multiple CPU cores?
I think it is. #71 enforces mapping by cores, and I think this information (https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners#supported-runners-and-hardware-resources) is precise, i.e., we have 2 cores available per runner.