scalapack icon indicating copy to clipboard operation
scalapack copied to clipboard

xshseqr and xdhseqr fail with FPE if run in parallel

Open drhpc opened this issue 1 year ago • 4 comments

In current master, two tests fail if run in parallel:

69/70 Testing: xshseqr
69/70 Test: xshseqr
Command: "/sw/env/gcc-10.3.0/openmpi/4.1.1/bin/mpiexec" "-n" "2" "./xshseqr"
Directory: /home/rrztest/src/scalapack/TESTING
"xshseqr" start time: Jul 25 20:04 CEST
Output:
----------------------------------------------------------

 ScaLAPACK Test for PSHSEQR

 epsilon   =    5.96046448E-08
 threshold =    30.0000000    

 Residual and Orthogonality Residual computed by:

 Residual      =  || T - Q^T*A*Q ||_F / ( ||A||_F * eps * sqrt(N) )

 Orthogonality =  MAX( || I - Q^T*Q ||_F, || I - Q*Q^T ||_F ) /  (eps * N)

 Test passes if both residuals are less then threshold

    N  NB    P    Q  QR Time  CHECK
----- --- ---- ---- -------- ------

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x151fa27c93ff in ???
#1  0x151fa455124f in pstrord_
        at /home/rrztest/src/scalapack/SRC/pstrord.f:1087
#2  0x151fa457a300 in pslaqr3_
        at /home/rrztest/src/scalapack/SRC/pslaqr3.f:880
#3  0x151fa4565178 in pslaqr0_
        at /home/rrztest/src/scalapack/SRC/pslaqr0.f:598
#4  0x151fa456209d in pshseqr_
        at /home/rrztest/src/scalapack/SRC/pshseqr.f:441
#5  0x4036cf in pshseqrdriver
        at /home/rrztest/src/scalapack/TESTING/EIG/pshseqrdriver.f:413
#6  0x404427 in main
        at /home/rrztest/src/scalapack/TESTING/EIG/pshseqrdriver.f:565
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 0 on node node002 exited on signal 8 (Floating point exception).
--------------------------------------------------------------------------
<end of output>
Test time =   2.91 sec
----------------------------------------------------------
Test Failed.
"xshseqr" end time: Jul 25 20:04 CEST
"xshseqr" time elapsed: 00:00:02
----------------------------------------------------------

70/70 Testing: xdhseqr
70/70 Test: xdhseqr
Command: "/sw/env/gcc-10.3.0/openmpi/4.1.1/bin/mpiexec" "-n" "2" "./xdhseqr"
Directory: /home/rrztest/src/scalapack/TESTING
"xdhseqr" start time: Jul 25 20:04 CEST
Output:
----------------------------------------------------------

 ScaLAPACK Test for PDHSEQR

 epsilon   =    1.1102230246251565E-016
 threshold =    30.000000000000000     

 Residual and Orthogonality Residual computed by:

 Residual      =  || T - Q^T*A*Q ||_F / ( ||A||_F * eps * sqrt(N) )

 Orthogonality =  MAX( || I - Q^T*Q ||_F, || I - Q*Q^T ||_F ) /  (eps * N)

 Test passes if both residuals are less then threshold

    N  NB    P    Q  QR Time  CHECK
----- --- ---- ---- -------- ------

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x1488be0113ff in ???
#1  0x1488bff4ebae in pdtrord_
        at /home/rrztest/src/scalapack/SRC/pdtrord.f:1087
#2  0x1488bff77f2f in pdlaqr3_
        at /home/rrztest/src/scalapack/SRC/pdlaqr3.f:878
#3  0x1488bff62d2b in pdlaqr0_
        at /home/rrztest/src/scalapack/SRC/pdlaqr0.f:598
#4  0x1488bff5fc1d in pdhseqr_
        at /home/rrztest/src/scalapack/SRC/pdhseqr.f:441
#5  0x4036e2 in pdhseqrdriver
        at /home/rrztest/src/scalapack/TESTING/EIG/pdhseqrdriver.f:412
#6  0x404445 in main
        at /home/rrztest/src/scalapack/TESTING/EIG/pdhseqrdriver.f:564
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 0 on node node002 exited on signal 8 (Floating point exception).
--------------------------------------------------------------------------
<end of output>
Test time =   2.70 sec
----------------------------------------------------------
Test Failed.
"xdhseqr" end time: Jul 25 20:04 CEST
"xdhseqr" time elapsed: 00:00:02
----------------------------------------------------------

End testing: Jul 25 20:04 CEST

Both tests pass fine with -n 1. I tested on two machines with differing compilers and MPI versions (4.1.1 and 1.10.7).

I observe weirdly long runtimes (hundreds of seconds) for some 2.2.0 tests when run inside the pkgsrc build framework, but they do succeed eventually. These FPEs are more definite.

drhpc avatar Jul 25 '22 18:07 drhpc

Thanks for the bug report! More details about this bug:

  • It was not detected because some tests were disabled in the Github Actions. Now they are enabled, see 782e739f8eb0e7f4d51ad7dd23fc1d03dc99d240. (My bad, I shouldn't commit directly to the repository. To avoid that, I have just enabled the rule "Require a pull request before merging".)
  • The code breaks at

https://github.com/Reference-ScaLAPACK/scalapack/blob/de3919e49ab7c47b76248a9123ed448305ee84a6/SRC/pstrord.f#L1087-L1088

  • It breaks because NWIN = LIHI - I + 1 assumes value 0 during the execution of the test.

  • In my Linux machine, I added the following prints for debugging purposes:

*
               WRITE(*,*) "NWIN = ", NWIN, ", ILO = ", ILO, ", LIHI = ",
     $            LIHI, ", I = ", I
*
               IF( FLOPS.NE.0 .AND.
     $              ( FLOPS*100 ) / ( 2*NWIN*NWIN ) .GE. MMULT ) THEN
*

Result of the tests:

$ mpiexec -n 2 xshseqr
[...]
 NWIN =            4 , ILO =           73 , LIHI =           76 , I =           73
 NWIN =            4 , ILO =           73 , LIHI =           76 , I =           73
 NWIN =            4 , ILO =           73 , LIHI =           76 , I =           73
 NWIN =            4 , ILO =           73 , LIHI =           74 , I =           73
 NWIN =           19 , ILO =           31 , LIHI =           49 , I =           31
 NWIN =           19 , ILO =           31 , LIHI =           49 , I =           31
 NWIN =           19 , ILO =           31 , LIHI =           49 , I =           31
 NWIN =           19 , ILO =           31 , LIHI =           44 , I =           31
 NWIN =            0 , ILO =           45 , LIHI =           51 , I =           52

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x7fd360623d21 in ???
#1  0x7fd360622ef5 in ???
#2  0x7fd36045408f in ???
        at /build/glibc-SzIz7B/glibc-2.31/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
#3  0x7fd362e1ecb6 in pstrord_
        at ../SRC/pstrord.f:1090
#4  0x7fd362e4acc5 in pslaqr3_
        at ../SRC/pslaqr3.f:880
#5  0x7fd362e340da in pslaqr0_
        at ../SRC/pslaqr0.f:598
#6  0x7fd362e30dbf in pshseqr_
        at ../SRC/pshseqr.f:441
#7  0x558c3509e9ac in pshseqrdriver
        at ../TESTING/EIG/pshseqrdriver.f:413
#8  0x558c3509f8bd in main
        at ../TESTING/EIG/pshseqrdriver.f:565
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 0 on node weslleyp-XPS-15-9510 exited on signal 8 (Floating point exception).
--------------------------------------------------------------------------

weslleyspereira avatar Jul 25 '22 21:07 weslleyspereira

Minor update to my last message: All tests still pass in the Github Actions, see https://github.com/Reference-ScaLAPACK/scalapack/actions/runs/2735265869.

Test xshseqr is still failing in my personal machine.

weslleyspereira avatar Jul 25 '22 21:07 weslleyspereira

So Github Actions are not actually using multiple CPU cores?

drhpc avatar Jul 26 '22 12:07 drhpc

So Github Actions are not actually using multiple CPU cores?

I think it is. #71 enforces mapping by cores, and I think this information (https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners#supported-runners-and-hardware-resources) is precise, i.e., we have 2 cores available per runner.

weslleyspereira avatar Jul 27 '22 14:07 weslleyspereira