lapack icon indicating copy to clipboard operation
lapack copied to clipboard

BUG: 4x4 test example for sy/hevr with INFO=2

Open ilayn opened this issue 4 years ago • 15 comments
trafficstars

Dear all,

Based on a user bug report to SciPy https://github.com/scipy/scipy/issues/13220, I have been scratching my head with this very basic example which causes sy/heevr fail (possibly during internal dstebz call) when eigenvectors are not requested. With eigenvector request it goes through.

This is my Cython calling code

cimport cython
cimport numpy as cnp
import numpy as np
from scipy.linalg.cython_lapack cimport dsyevr

cdef double[::1, :] A = np.array([[1.0, 0.0, 0.0, 0.0],
                                  [0.0, 0.8100000000000002753, 0., 0.],
                                  [0.0, 0.1800000000000001876, 0.04000000000000007022, 0.],
                                  [-0.3999999999999999112, 0., 0., 0.1599999999999999201]],
                                order='F')
cdef int LDA = 4
cdef char* JOBZ = 'N'
cdef char* RANGE = 'I'
cdef char* UPLO = 'L'
cdef int N = 4
cdef double VL = 0.0
cdef double VU = 1.0
cdef int IL = 4
cdef int IU = 4
cdef double ABSTOL = 0.0
cdef int M
cdef double[::1] W = np.empty(4, dtype=float)
cdef double[::1, :] Z = np.empty([4, 4], order='F')
cdef int LDZ = 4
cdef int[::1] ISUPPZ = np.empty(4, dtype=int)
cdef double[::1] WORK = np.empty(132)
cdef int LWORK = 132
cdef int[::1] IWORK = np.empty(40, dtype=int)
cdef int LIWORK = 40
cdef int INFO

dsyevr(JOBZ, RANGE, UPLO, &N, &A[0, 0], &LDA, &VL, &VU, &IL, &IU, &ABSTOL,
       &M, &W[0], &Z[0, 0], &LDZ, &ISUPPZ[0], &WORK[0], &LWORK, &IWORK[0], &LIWORK, &INFO)
print("info : ",INFO)
print("eigval array is : ",np.array(W))
print("Eigvec array is : ",np.array(Z))
print("Number of eigs found : ", M)

and this is the output I get

info :  2
eigval array is :  [1.16000000e+000 2.12162195e+161 1.94819817e+233 5.28562263e+180]
Eigvec array is :  
[[6.23042070e-307 1.86918699e-306 1.69121096e-306 1.06810879e-306]
 [1.42409073e-306 1.78019082e-306 1.42419938e-306 7.56603881e-307]
 [8.45603441e-307 1.42417221e-306 1.60219306e-306 6.23059726e-307]
 [1.06811422e-306 1.42417221e-306 1.37961641e-306 1.20955411e-312]]
Number of eigs found :  0

As you can see Z array is not referenced as expected and the correct eigenvalue is written to the W array first element however somehow the return is with M=0 and info set to 2. I don't have the toolchain to perform a full analysis however could you please point me to a direction why this might be happening?

Pinging @martin-frbg since this is both happening on OpenBLAS just for heads-up and also the user was on IntelMKL.

ilayn avatar Dec 13 '20 23:12 ilayn

A is not symmetric?

zerothi avatar Dec 14 '20 05:12 zerothi

Only the lower part is referenced through UPLO flag

ilayn avatar Dec 14 '20 10:12 ilayn

Indeed there seems to be a problem here.

langou avatar Dec 14 '20 15:12 langou

What I also noticed is that if I transpose and use the uplo='U' it doesn't have a problem.

ilayn avatar Dec 14 '20 16:12 ilayn

As one can see, the script calls DSYEVR with JOBZ='N', RANGE='I' , and UPLO='L'. This indeed leads to a problem in DSTEBZ (bisection), a situation that I have not observed before. Given the input matrix (UPLO='L') we get the tridiagonal (D and E)

 1.00000000E+00    -4.00000000E-01
 1.60000000E-01     0.00000000E+00
 4.00000000E-02     1.80000000E-01
 8.10000000E-01     6.95311974-310

which obviously splits, with eigenvalues (close to) 0.0 and 1.16 in the top part, and (close to) 0.0 and 0.85 in the bottom part. DSTEBZ is called with RANGE=I and IL=IU=4 on that tridiagonal and then returns INFO=2. The comments for INFO in DSTEBZ say:

222 *> =2 or 3: RANGE='I' only: Not all of the eigenvalues 223 *> IL:IU were found. 224 *> Effect: M < IU+1-IL 225 *> Cause: non-monotonic arithmetic, causing the 226 *> Sturm sequence to be non-monotonic. 227 *> Cure: recalculate, using RANGE='A', and pick 228 *> out eigenvalues IL:IU. In some cases, 229 *> increasing the PARAMETER "FUDGE" may 230 *> make things work.

IL=IU=4 corresponds to an eigenvalue in the top part of the matrix. It remains to be investigated whether the problem in DSTEBZ is caused by a combination of the splitting and those eigenvalues close to 0.0 or something else. (If we set UPLO='U' then the tridiagonal is defined by the diagonal of the matrix and the eigenvalues are straightforward.)

If we call DSYEVR with JOBZ='V', RANGE='A' , and UPLO='L', then another algorithm is used: DSTEMR (MRRR) is invoked and we get

2.2204E-16 1.1102E-15 8.5000E-01 1.1600E+00

for the eigenvalues, as expected.

Osni

oamarques avatar Dec 15 '20 16:12 oamarques

Ah yes, sorry for the terse comment; what I meant is if you transpose the original matrix so that the interested part is on the upper triangle and then use UPLO='U' then surprisingly dstebz goes through.

By the way this also applies to the he family of evr

ilayn avatar Dec 15 '20 16:12 ilayn

I apologize before hand, but both Julia and R are giving errors when calculating eigenvalues. Is this related? Here is the core dump:


PID: 3209 (R)
           UID: 1000 (danielc)
           GID: 1000 (danielc)
        Signal: 4 (ILL)
     Timestamp: Thu 2020-12-17 17:48:18 -03 (7min ago)
  Command Line: /usr/lib64/R/bin/exec/R
    Executable: /usr/lib/R/bin/exec/R
 Control Group: /user.slice/user-1000.slice/session-2.scope
          Unit: session-2.scope
         Slice: user-1000.slice
       Session: 2
     Owner UID: 1000 (danielc)
      Hostname: daniel-vostro3560
       Storage: /var/lib/systemd/coredump/core.R.1000.5a08490df9c74153aa26c65ab6554092.3209.1608238>
       Message: Process 3209 (R) of user 1000 dumped core.
                
                Stack trace of thread 3209:
                #0  0x00007f0f5395cf76 dsymv_thread_L (libblas.so.3 + 0xccf76)
                #1  0x00007f0f5392f8e7 dsymv_ (libblas.so.3 + 0x9f8e7)
                #2  0x00007f0f2fd05b7f dsytd2_ (liblapack.so.3 + 0x237b7f)
                #3  0x00007f0f2fd0720c dsytrd_ (liblapack.so.3 + 0x23920c)
                #4  0x00007f0f2fcffcd1 dsyevr_ (liblapack.so.3 + 0x231cd1)
                #5  0x00007f0f54d85b50 n/a (lapack.so + 0x3b50)
                #6  0x00007f0f54ecfa26 n/a (libR.so + 0x136a26)
                #7  0x00007f0f54ee9a80 Rf_eval (libR.so + 0x150a80)
                #8  0x00007f0f54eeb7b7 n/a (libR.so + 0x1527b7)
                #9  0x00007f0f54eec614 Rf_applyClosure (libR.so + 0x153614)
                #10 0x00007f0f54ee9cb1 Rf_eval (libR.so + 0x150cb1)
                #11 0x00007f0f54f1e046 Rf_ReplIteration (libR.so + 0x185046)
                #12 0x00007f0f54f1e3e1 n/a (libR.so + 0x1853e1)
                #13 0x00007f0f54f1e499 run_Rmainloop (libR.so + 0x185499)
                #14 0x000055c0bb8ea03d main (R + 0x103d)
                #15 0x00007f0f54bc2152 __libc_start_main (libc.so.6 + 0x28152)
                #16 0x000055c0bb8ea07e _start (R + 0x107e)

Here is the code that generates the error, in R:


aa <- matrix(rnorm(50*10),ncol=10)
 B <- t(aa)%*%aa
eigen(B)

I will be glad to provide more details or open a new issue if it is necessary

danmrc avatar Dec 17 '20 21:12 danmrc

@danmrc dsymv_thread_L in your backtrace makes it llikely that your libblas.so.3 is OpenBLAS rather than the reference implementation. Can you tell which version it is ? (Does not completely exclude the possibilty that your problem is related, but wrong eigenvalues should not lead to an array overflow)

martin-frbg avatar Dec 17 '20 21:12 martin-frbg

@danmrc dsymv_thread_L in your backtrace makes it llikely that your libblas.so.3 is OpenBLAS rather than the reference implementation. Can you tell which version it is ? (Does not completely exclude the possibilty that your problem is related, but wrong eigenvalues should not lead to an array overflow)

I was not sure if it was an openblas issue and I was not sure if I opened an issue here or in openblas repo.

The OpenBLAS version is 0.3.13 and the LAPACK version is 3.9.0-3.

I tried uninstalled OpenBLAS and moved to BLAS and the error is gone, so this clears the issue, I guess? Thanks for the help

danmrc avatar Dec 17 '20 22:12 danmrc

A similar to the original problem namely dsyevr failure is reported again to SciPy


a = np.array([[0.9349689745783974, 0.0225235320981566, -0.05953165551175491, 0.014890259948252545, -0.12836839955836318, -0.0480224626614684, -0.06886406875564752, 0.04046562569985317, 0.03714156517414783, -0.018402704526518737, 0.07197525245559412],
              [0.02252353209815662, 0.9230010962223161, -0.1044639804366201, 0.06368964318517553, 0.08081190940668663, 0.03276043172122358, 0.07485362171880014, -0.027502246383933115, -0.10909283641841855, -0.03176972346464801, 0.001635024881184706],
              [-0.05953165551175491, -0.10446398043662013, 0.43689356064293466, -0.04106527517776532, 0.02664014770883955, 0.09124845509262787, 0.025622907523148167, -0.011849459121165726, -0.09351301483316642, -0.16978156327404034, 0.10678633516502224],
              [0.014890259948252547, 0.06368964318517553, -0.04106527517776534, 0.7915128078107512, 0.07906392851745978, 0.05470555765353876, -0.008992022336606922, 0.06682515975862707, 0.06991049787262885, -0.09114607417272744, -0.052552901004017695],
              [-0.12836839955836318, 0.08081190940668662, 0.026640147708839546, 0.07906392851745979, 0.513176133859719, 0.1336991193812062, -0.2107867463135503, -0.0027064628912624126, 0.04143438188034744, 0.19613412836894376, 0.26763418534029276],
              [-0.04802246266146838, 0.032760431721223615, 0.09124845509262787, 0.054705557653538746, 0.13369911938120618, 0.44124749836332827, -0.04337796788306535, 0.018302593355378014, 0.2851922695999016, -0.11805040819010776, -0.19472560800690397],
              [-0.0688640687556475, 0.07485362171880014, 0.025622907523148167, -0.008992022336606904, -0.21078674631355024, -0.043377967883065346, 0.8546191070536638, -0.04265722515539972, 0.16331372758759188, 0.10808523737411606, 0.06733876708529098],
              [0.04046562569985316, -0.02750224638393315, -0.011849459121165733, 0.06682515975862709, -0.0027064628912624425, 0.018302593355378, -0.04265722515539969, 0.6987197080750596, 0.14911240832667116, 0.25689421976827004, -0.04791037128627005],
              [0.037141565174147836, -0.10909283641841856, -0.09351301483316642, 0.06991049787262886, 0.04143438188034744, 0.2851922695999017, 0.1633137275875919, 0.14911240832667114, 0.5916376238705497, -0.12495295168958077, 0.12285074933595315],
              [-0.018402704526518758, -0.031769723464648084, -0.16978156327404037, -0.09114607417272744, 0.19613412836894378, -0.11805040819010776, 0.10808523737411603, 0.25689421976827004, -0.12495295168958075, 0.6403839116892462, -0.045363110885673706],
              [0.07197525245559414, 0.0016350248811847071, 0.10678633516502221, -0.052552901004017674, 0.26763418534029265, -0.19472560800690397, 0.06733876708529103, -0.04791037128627001, 0.12285074933595318, -0.0453631108856737, 0.785416519944824]])

This time it fails when the largest two eigenvalues are requested instead leads to the failure. N,LDA = 11 and IL, IU= 10,11

Just like the original case, when UPLO switch is changed the problem goes through without error( the array is tiny bit different in the lower part around eps value) but reports empty array. So I'm not sure what to make out of this.

ilayn avatar Apr 21 '21 18:04 ilayn

I am unable to reproduce the problem for this matrix of dimension 11, with gfortran and LAPACK 3.9.0. I would have to know the version of LAPACK that is being used and how it was compiled. This is what I get in four different scenarios, where D and E defines the symmetric tridiagonal matrix DSYEVR deals with:

DSYEVR: JOBZ=N,RANGE=I,UPLO=L ================== Calling DSTEBZ, RANGE=I, VLL=-6.9169-323, VUU= 9.8813-324, IL= 10, IU= 11 D and E: 9.34968975E-01 -1.90619961E-01 3.98659736E-01 -1.89051587E-01 1.27591122E-01 -1.11307662E-01 6.17567169E-01 -7.19276978E-02 4.47898019E-01 -1.88601707E-01 8.48919219E-02 -4.99831090E-13 1.00000000E+00 7.71553582E-16 1.00000000E+00 1.67136486E-15 1.00000000E+00 1.03555279E-15 1.00000000E+00 -2.31898783E-16 1.00000000E+00 0.00000000E+00 DSYEVR normal exit, m = 2, w(1:m) = 1.00000000E+00 1.00000000E+00 DSYEVR: JOBZ=N,RANGE=I,UPLO=U ================== Calling DSTEBZ, RANGE=I, VLL= 0.0000E+00, VUU= 0.0000E+00, IL= 10, IU= 11 D and E: 1.00000000E+00 -2.56644189E-16 1.00000000E+00 8.12228593E-16 1.00000000E+00 -7.08603844E-16 1.00000000E+00 -7.19286148E-16 1.00000000E+00 5.11871794E-13 1.75439639E-01 -2.34607279E-01 3.49849638E-01 7.82007891E-02 6.04150656E-01 9.53012421E-02 4.50974046E-01 -1.48255688E-01 2.45746444E-01 3.90991548E-01 7.85416520E-01 0.00000000E+00 DSYEVR normal exit, m = 2, w(1:m) = 1.00000000E+00 1.00000000E+00 DSYEVR: JOBZ=V,RANGE=A,UPLO=L ================== Calling DSTEMR DSYEVR normal exit, m = 11, w(1:m) = 0.00000000E+00 9.35710688E-03 4.22925331E-01 5.03963181E-01 6.75331324E-01 1.00000000E+00 1.00000000E+00 1.00000000E+00 1.00000000E+00 1.00000000E+00 1.00000000E+00 DSYEVR: JOBZ=V,RANGE=A,UPLO=U ================== Calling DSTEMR DSYEVR normal exit, m = 11, w(1:m) = 4.44089210E-16 9.35710688E-03 4.22925331E-01 5.03963181E-01 6.75331324E-01 1.00000000E+00 1.00000000E+00 1.00000000E+00 1.00000000E+00 1.00000000E+00 1.00000000E+00

oamarques avatar Apr 22 '21 16:04 oamarques

I'll cook up an example like the original post as soon as I switch to my other computer.

ilayn avatar Apr 22 '21 16:04 ilayn

Here is an example that leads to info=2 in DSYEVR call on my machine

cimport cython
cimport numpy as cnp
import numpy as np
from scipy.linalg.cython_lapack cimport dsyevr

cdef double[::1, :] A = np.array([[0.9349689745783974, 0.02252353209815662, -0.05953165551175491, 0.014890259948252547, -0.12836839955836318, -0.04802246266146838, -0.0688640687556475, 0.04046562569985316, 0.037141565174147836, -0.018402704526518758, 0.07197525245559414],
              [0.0225235320981566, 0.9230010962223161, -0.10446398043662013, 0.06368964318517553, 0.08081190940668662, 0.032760431721223615, 0.07485362171880014, -0.02750224638393315, -0.10909283641841856, -0.031769723464648084, 0.0016350248811847071],
              [-0.05953165551175491, -0.1044639804366201, 0.43689356064293466, -0.04106527517776534, 0.026640147708839546, 0.09124845509262787, 0.025622907523148167, -0.011849459121165733, -0.09351301483316642, -0.16978156327404037, 0.10678633516502221],
              [0.014890259948252545, 0.06368964318517553, -0.04106527517776532, 0.7915128078107512, 0.07906392851745979, 0.054705557653538746, -0.008992022336606904, 0.06682515975862709, 0.06991049787262886, -0.09114607417272744, -0.052552901004017674],
              [-0.12836839955836318, 0.08081190940668663, 0.02664014770883955, 0.07906392851745978, 0.513176133859719, 0.13369911938120618, -0.21078674631355024, -0.0027064628912624425, 0.04143438188034744, 0.19613412836894378, 0.26763418534029265],
              [-0.0480224626614684, 0.03276043172122358, 0.09124845509262787, 0.05470555765353876, 0.1336991193812062, 0.44124749836332827, -0.043377967883065346, 0.018302593355378, 0.2851922695999017, -0.11805040819010776, -0.19472560800690397],
              [-0.06886406875564752, 0.07485362171880014, 0.025622907523148167, -0.008992022336606922, -0.2107867463135503, -0.04337796788306535, 0.8546191070536638, -0.04265722515539969, 0.1633137275875919, 0.10808523737411603, 0.06733876708529103],
              [0.04046562569985317, -0.027502246383933115, -0.011849459121165726, 0.06682515975862707, -0.0027064628912624126, 0.018302593355378014, -0.04265722515539972, 0.6987197080750596, 0.14911240832667114, 0.25689421976827004, -0.04791037128627001],
              [0.03714156517414783, -0.10909283641841855, -0.09351301483316642, 0.06991049787262885, 0.04143438188034744, 0.2851922695999016, 0.16331372758759188, 0.14911240832667116, 0.5916376238705497, -0.12495295168958075, 0.12285074933595318],
              [-0.018402704526518737, -0.03176972346464801, -0.16978156327404034, -0.09114607417272744, 0.19613412836894376, -0.11805040819010776, 0.10808523737411606, 0.25689421976827004, -0.12495295168958077, 0.6403839116892462, -0.0453631108856737],
              [0.07197525245559412, 0.001635024881184706, 0.10678633516502224, -0.052552901004017695, 0.26763418534029276, -0.19472560800690397, 0.06733876708529098, -0.04791037128627005, 0.12285074933595315, -0.045363110885673706, 0.785416519944824]],
              order='F')

cdef int LDA = 11
cdef char* JOBZ = 'N'
cdef char* RANGE = 'I'
cdef char* UPLO = 'L'
cdef int N = 11
cdef double VL = 0.0
cdef double VU = 1.0
cdef int IL = 10
cdef int IU = 11
cdef double ABSTOL = 0.0
cdef int M
cdef double[::1] W = np.empty(11, dtype=float)
cdef double[::1, :] Z = np.empty([11, 11], order='F')
cdef int LDZ = 11
cdef int[::1] ISUPPZ = np.empty(11, dtype=int)
cdef double[::1] WORK = np.empty(363)
cdef int LWORK = 363
cdef int[::1] IWORK = np.empty(110, dtype=int)
cdef int LIWORK = 110
cdef int INFO

dsyevr(JOBZ, RANGE, UPLO, &N, &A[0, 0], &LDA, &VL, &VU, &IL, &IU, &ABSTOL,
       &M, &W[0], &Z[0, 0], &LDZ, &ISUPPZ[0], &WORK[0], &LWORK, &IWORK[0], &LIWORK, &INFO)
print("info : ",INFO)
print("eigval array is : ",np.array(W))
# print("Eigvec array is : ",np.array(Z)) # Not requested
print("Number of eigs found : ", M)

I receive

info :  2
eigval array is :  [1.  1.  0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1. ]
Number of eigs found :  0

What is fascinating to me is that max(abs(A - A.T)) is 1.11e-16 hence this almost perfectly symmetric however as we discovered in these cases playing around with UPLO clears the error. Hence there is probably a branching that is not so robust somewhere.

ilayn avatar Apr 25 '21 07:04 ilayn

@ilayn again is this with a numpy/scipy that uses reference BLAS or OpenBLAS ? (In the latter case there might be FMA-induced rounding differences to what @oamarques would see with plain gfortran and reference BLAS (though the same problem might pop up with a sufficiently aggressive optimizing compiler)

martin-frbg avatar Apr 25 '21 11:04 martin-frbg

This is the same machine as the original post (with OpenBLAS 3.11) the users are using MKL and OpenBLAS too. However @oamarques could also reproduce this with his analysis above and I think he is on the reference implementation. I am guessing he will be able to do it again since this much of sensitivity is not expected from these routines as they are generally backed by proper error analysis.

ilayn avatar Apr 25 '21 12:04 ilayn