OpenBLAS DGELSD fails to converge for some matrices

DGELSD fails to converge in some rare cases. I have attached a set of matrices (as binary files) that reproduce the problem. The LHS matrix is 1602x1602, the RHS matrix is 1602x1. I am only able to reliably reproduce the problem with OMP_NUM_THREADS=1.

matrices.zip

Here's a minimal Fortran program that reproduces the problem:

program solve_least_squares
    implicit none
    integer, parameter :: m = 1602, n = 1602, nrhs = 1
    double precision, dimension(m, n) :: lhs
    double precision, dimension(m, nrhs) :: rhs
    double precision, dimension(min(m, n)) :: s
    double precision :: rcond
    integer :: rank, info, lwork, liwork
    double precision, allocatable :: work(:)
    integer, allocatable :: iwork(:)

    ! Declare the LAPACK routine
    interface
        subroutine DGELSD(m, n, nrhs, a, lda, b, ldb, s, rcond, rank, work, lwork, iwork, info)
            integer, intent(in) :: m, n, nrhs, lda, ldb, lwork
            double precision, intent(inout) :: a(lda, *), b(ldb, *), s(*), work(*)
            double precision, intent(in) :: rcond
            integer, intent(out) :: rank, iwork(*), info
        end subroutine DGELSD
    end interface

    ! Use machine precision as the default
    rcond = -1.0d0

    ! Open and read lhs.bin
    print *, "Reading lhs.bin"
    open(unit=10, file="lhs.bin", form="unformatted", access="stream", status="old")
    read(10) lhs
    close(10)

    ! Open and read rhs.bin
    print *, "Reading rhs.bin"
    open(unit=11, file="rhs.bin", form="unformatted", access="stream", status="old")
    read(11) rhs
    close(11)

    ! Workspace query
    print *, "Querying workspace"
    lwork = -1
    allocate(work(1))
    allocate(iwork(1))

    ! Query optimal workspace size
    print *, "Calling DGELSD to get workspace size"
    call DGELSD(m, n, nrhs, lhs, m, rhs, m, s, rcond, rank, work, lwork, iwork, info)
    if (info /= 0) then
        print *, "Error during workspace query: DGELSD failed with info =", info
        stop
    end if

    ! Allocate optimal workspace
    lwork = int(work(1))
    liwork = iwork(1)
    deallocate(work, iwork)
    allocate(work(lwork), iwork(liwork))

    ! Solve the least-squares problem
    print *, "Calling DGELSD to solve the least-squares problem"
    call DGELSD(m, n, nrhs, lhs, m, rhs, m, s, rcond, rank, work, lwork, iwork, info)
    if (info /= 0) then
        print *, "Error: DGELSD failed with info =", info
        stop
    end if

    ! Write the result to res.bin
    print *, "Writing res.bin"
    open(unit=12, file="res.bin", form="unformatted", access="stream", status="replace")
    write(12) rhs(1:n, 1)  ! Write only the solution vector
    close(12)

    ! Deallocate workspace
    deallocate(work, iwork)

    print *, "Solution written to res.bin"
end program solve_least_squares

When the problem occurs, it produces the following error:

 Error: DGELSD failed with info =           1
Note: The following floating-point exceptions are signalling: IEEE_UNDERFLOW_FLAG IEEE_DENORMAL

... which indicates that the SVD failed to converge.

I can reproduce the problem with OpenBLAS versions v0.3.18 through v0.3.30. I bisected it down to this commit:

47ba85f314808476c8254779389607f9af60231f is the first bad commit
commit 47ba85f314808476c8254779389607f9af60231f
Author: Martin Kroeker <[email protected]>
Date:   Thu Jul 22 17:24:15 2021 +0200

    Fix regex to match kernels suffixed with cpuname too

 cmake/utils.cmake | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

... which had the effect of enabling FMA instructions for many files.

This problem seems to be highly sensitive to floating point rounding errors. Even tiny changes to the input matrix make the problem go away, as does setting OMP_NUM_THREADS to anything other than 1. I don't know whether the introduction of FMA instructions actually caused this problem, or merely changed the rounding errors slightly such that my example matrix no longer works to trigger the problem.

I have compiled OpenBLAS with make TARGET=HASWELL on Red Hat Enterprise Linux release 8.10 (Ootpa). The pre-compiled OpenBLAS library included with Scipy shares the same issue. The corresponding Scipy issue can be found here: https://github.com/scipy/scipy/issues/23220

Jun 23 '25 15:06 MaartenBaert

On closer inspection, it looks like the first commit that caused the problem was actually this one:

49bbf330ca592f439a07f24f137e61af1cc9c616 is the first bad commit
commit 49bbf330ca592f439a07f24f137e61af1cc9c616
Author: Martin Kroeker <[email protected]>
Date:   Sun Jul 18 22:19:19 2021 +0200

    Empirical workaround for numpy SVD NaN problem from issue 3318

 kernel/Makefile.L2 | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

So perhaps this problem is related to issue #3318?

It looks like the only file affected by this change is kernel/x86_64/dgemv_t_4.c. When this file is compiled with -mfma, the SVD fails to converge. When compiled without -mfma, it works. Unfortunately this is the exact opposite of issue #3318.

Jun 23 '25 16:06 MaartenBaert

Just to clarify - are you building with CMake or plain make ? The file from your first bisect is only needed in one, and the file from your second only in the other...

Jun 23 '25 16:06 martin-frbg

I am using plain make. I have also tried manually recompiling just the one file kernel/x86_64/dgemv_t_4.c with and without -mfma and confirmed that this is enough to make the problem appear or disappear.

I may have had issues with incomplete rebuilds the first time I ran the bisect. The second time I completely cleaned the repository each time, so it should be accurate.

Jun 23 '25 17:06 MaartenBaert

Apparently the FMA flag is a red herring - I managed to reproduce the problem with or without FMA enabled by writing a bidiagonalized matrix generated by an affected version to a file, and then using that matrix as the input of either DGELSD or DLALSD in other (previously unaffected) versions of OpenBLAS, which reproduces the problem more reliably. It also makes reproducing the problem independent of OMP_NUM_THREADS.

Here's the bidiagonal matrix that triggers the problem (D = main diagonal, E = super-diagonal): bidiag_de.zip

Jun 23 '25 18:06 MaartenBaert

Agree that it has nothing to do with the FMA flag, the original "effect" is not reproducible at -O1 -mfma instead of the default -O2 (or conventional -O3 for CMake "Release" builds). I'm currently trying to find the exact GCC option switched on at -O2 that triggers it, but it might be equally enlightening to see if the problem occurs with the original Reference-LAPACK (and its BLAS) as well - this could mean that either the matrix is ill conditioned, or the DGELSD algorithm ill equipped to handle it.

Jun 23 '25 19:06 martin-frbg

Blind shot: In reference-LAPACK, the algorithm swap of NRM2 changed the convergence behavior of some matrices (in both directions,some previously converged went to diverged and some previously diverged went to converged). If the matrix is sensitive to small roundoff errors, you can try swapping out NRM2.

Also, I second Martin, it would be good to know if the problem reproduces with reference-LAPACK + reference-BLAS. If so, is the 1602x1602 problem the smallest size (the title mentions "some matrices")?

Jun 23 '25 19:06 angsch

It turns out that with the bidiagonal form of the matrix I can reproduce the problem with the latest reference LAPACK+BLAS as well.

I have not yet found a smaller test case that triggers the problem, only larger ones. Also, I found that only the LHS is relevant, the RHS can be zero and it will still trigger the problem.

I haven't been able to reproduce the problem with MKL so far.

Jun 23 '25 19:06 MaartenBaert

It appears that the decisive difference in compiler options between the default -O2 and a build at -O1 that passes the test is -fexpensive-optimizations , but I have not checked yet if this matters specifically for compilation of GEMV or one of the other functions in the call graph (like NRM2 as @angsch mentioned) - I think we've seen this effect before, somewhere in Reference-LAPACK code.

Jun 24 '25 07:06 martin-frbg

hmm.. the -fexpensive-optimizations-related problem in Reference-LAPACK was with NRM2 on Apple M1 (issue 1073 there), despite BLAS being Fortran code there as well, and also using a revised algorithm for NRM2...

Jun 24 '25 09:06 martin-frbg

@martin-frbg Did you test this with the original example, or with the bidiagonal matrix I shared later? For reference, here's the updated example that uses the bidiagonal matrix, which triggers the problem reliably for me: example.zip This example fails for me with reference LAPACK with or without optimization.

Jun 24 '25 10:06 MaartenBaert

That was still with the original testcase. The bidiagonal case seems to fail no matter what variations in compiler options or NRM2 code I throw at it. (If I replace the Haswell DNRM2 kernel with a simple C version, the testcase produces an additional IEEE_DIVIDE_BY_ZERO exception that is raised by "ETA = B / A" in line 350 of LAPACK's DLAED6.f , but that may be an unrelated oddity. (Though it could perhaps explain why DGELSD seems to be vulnerable to DNRM2's precision - the next line after this division checks if A is less or equal zero...)

Jun 24 '25 13:06 martin-frbg

(fixing that division in DLAED6 doesn't help, the IEEE_DENORMAL message doesn't provide a lead either - just DLAMCH computing the smallest "safely" usable number - and the IEEE_UNDERFLOW is scaling a 1.e-298 matrix element by 1.e-14 in DLASR line 273 )

Jun 24 '25 14:06 martin-frbg

I have narrowed down the problem to a convergence failure in DLASD4, which is a root finding algorithm for calculating updated singular values. It is failing for a subproblem with N=233 on the 4th singular value. The reason for the failure seems to be that the caller (DLASD6) has violated the precondition that the original singular values D must be in sorted order. The first 10 values of D are:

0.0
0.9999999970963765
0.9999999970963985
0.99999999709642
0.9999999970964196
0.999999997096431
0.9999999970964446
0.9999999970964584
0.9999999970964952
0.9999999970965138

Note that the 5th entry is smaller than the 4th, which is not allowed. These values come directly from DLASD7, which is responsible for merging and sorting singular values. So I suspect that's the real source of the problem. The documentation states that the output of DLASD7 will be in sorted order, but in this case it isn't.

Jun 24 '25 18:06 MaartenBaert

That then would seem to be a failure of DLAMRG (which is supposed to generate the index array to turn D into a sorted list, if I understand the comments correctly ? At least this function does not make use of any BLAS calls...

Jun 24 '25 18:06 martin-frbg

Indeed the output of DLAMRG is out of order, however I see that one of the two input arrays given to DLAMRG was already out of order. I added checks to DLAMRG to flag out-of-order inputs and I now see that there were out-of-order singular values throughout much of the SVD algorithm, it just didn't lead to immediate convergence failure.

I traced back the first instance of out-of-order singular values to the final part of DLASD6, where it first calls DLASD8 to calculate updated singular values (stored in D), then rescales them with DLASCL, then sorts them with DLAMRG. The scaling does not affect correct ordering, I checked that the order was already wrong before scaling. It seems that assumptions are made about the ordering of the values in D produces by DLASD8 which are incorrect in this case. In this case the inputs and outputs of DLAMRG are:

in:
N1=          19
N2=          30
A=   3.9183618637069977E-002   3.9183618637070504E-002   3.9183618637071337E-002   3.9183618637072072E-002   3.9183618637072183E-002   3.9183618637075382E-002   3.9183618637077214E-002   3.9183618637078386E-002   3.9183618637079594E-002   3.9183618637081974E-002   3.9183618637084576E-002   3.9183618637088136E-002   3.9183618637090051E-002   3.9183618637092646E-002   3.9183618637094769E-002   3.9183618637095977E-002   3.9183618637098537E-002   3.9183618637099779E-002   3.9183618637100404E-002   3.9183618637157844E-002   3.9183618637153889E-002   3.9183618637148108E-002   3.9183618637146839E-002   3.9183618637136652E-002   3.9183618637126709E-002   3.9183618637120672E-002   3.9183618637109729E-002   3.9183618637106919E-002   3.9183618637097718E-002   3.9183618637071635E-002   3.9183618637071101E-002   3.9183618637071156E-002   3.9183618637071156E-002   3.9183618637070830E-002   3.9183618637071031E-002   3.9183618637070670E-002   3.9183618637070733E-002   3.9183618637070455E-002   3.9183618637070435E-002   3.9183618637070407E-002   3.9183618637070379E-002   3.9183618637070282E-002   3.9183618637070268E-002   3.9183618637070108E-002   3.9183618637070150E-002   3.9183618637070039E-002   3.9183618637070025E-002   3.9183618637069921E-002   3.9183618637070004E-002
DTRD1=           1
DTRD2=          -1

out:
INDEX=           1          49          48          47          46          45          44          43          42          41          40          39          38           2          37          36          35          34          33          32          31           3          30           4           5           6           7           8           9          10          11          12          13          14          15          16          29          17          18          19          28          27          26          25          24          23          22          21          20

The out-of-order values are all extremely close to each other (near machine precision) so this looks like some subtle floating point issue. The out-of-order values are in the second half of D, which looks like it was calculated by DLASD7 and not touched by DLASD8.

Curiously, the documentation for DLASD7 states:

*>          D is DOUBLE PRECISION array, dimension ( N )
*>         On entry D contains the singular values of the two submatrices
*>         to be combined. On exit D contains the trailing (N-K) updated
*>         singular values (those which were deflated) sorted into
*>         increasing order.

.. which seems to be plain wrong, looking at the actual output values it is clear that the trailing N-K values of D are sorted in decreasing order (except for those values which were affected by rounding errors). This is also illustrated by the fact that the DTRD2 argument of DLAMRG is -1 rather than 1, indicating that the second input array is expected to be in descending order.

DLASD7 output:

K=          19
N=          49
D=  0.99999999999775857       0.99999999999775591       0.99999999999775802       0.99999999999775857       0.99999999999775890       0.99999999999776068       0.99999999999776179       0.99999999999776190       0.99999999999776490       0.99999999999776512       0.99999999999776756       0.99999999999776834       0.99999999999776901       0.99999999999776967       0.99999999999777500       0.99999999999777667       0.99999999999777922       0.99999999999778422       0.99999999999778599       0.99999999999999989       0.99999999999989886       0.99999999999975131       0.99999999999971889       0.99999999999945899       0.99999999999920519       0.99999999999905109       0.99999999999877187       0.99999999999870026       0.99999999999846545       0.99999999999779976       0.99999999999778599       0.99999999999778755       0.99999999999778744       0.99999999999777922       0.99999999999778422       0.99999999999777500       0.99999999999777667       0.99999999999776967       0.99999999999776901       0.99999999999776834       0.99999999999776756       0.99999999999776512       0.99999999999776490       0.99999999999776068       0.99999999999776179       0.99999999999775890       0.99999999999775857       0.99999999999775591       0.99999999999775802

The first 19 values of D should be ignored here since they will be overwritten by DLASD8, but the remaining 30 values are clearly in decreasing order except for those values affected by rounding errors.

Jun 25 '25 09:06 MaartenBaert

Doesn't look like either of the files received any pertinent changes in modern history (i.e. since tracking them with version control). Does it look to you like the sorting direction is changing halfway through the input (or could the "increasing" simply be a documentation error from ~20 years ago) ?

Jun 25 '25 10:06 martin-frbg

I think there's both a documentation error and an error in the implementation of DLASD7. The code is difficult to read but I suspect that under some circumstances (heavy deflation) it ends up outputting deflated values in the wrong order (i.e. the trailing part of D is not consistently ordered in decreasing order, like DLASD6 assumes). This creates a cascading error where mis-ordered output from DLASD7 gets mis-merged by DLAMRG which then results in more mis-ordered output for DLASD7 in the next layer of the divide-and-conquer tree.

Surprisingly this doesn't seem to immediately break the algorithm, because most of the times the mis-ordered input of DLASD7 doesn't make it past the deflation stage (which removes near-duplicate singular values) so the input of DLASD8 ends up being correctly ordered. However in my test case, the misordering eventually gets severe enough to allow a mis-ordered value to slip past the deflation step in DLASD7 and make its way to DLASD8, which then causes the root finder DLASD4 to diverge when it is fed the incorrectly ordered singular value.

The crucial line that allows this to happen is this line in DLASD7:

IF( ABS( D( J )-D( JPREV ) ).LE.TOL ) THEN

The ABS() is actually redundant here since D is supposed to be sorted, but has the unfortunate side effect that mis-sorted values are able to slip through if their difference is greater than TOL. If I replace this line with IF( ( D( J )-D( JPREV ) ).LE.TOL ) THEN, ensuring that mis-sorted values always get deflated, the failure actually goes away! The result of DGELSD is reasonably accurate, though it deviates by about 1e-12 from the result of DGELSY and the result I get when I calculate the SVD manually to solve the least squares problem. So I suspect there is some loss of accuracy due to incorrect deflation. I think the original cause of the mis-ordering still has to be fixed to get an accurate SVD, though removing the ABS() seems like a net improvement since it is pointless in the correctly-ordered case and harmful when there is mis-ordering.

Update: I think I found the particular DLASD7 invocation that created the initial mis-ordering. The deflation code is quite complicated and there are multiple paths by which deflation can happen. The algorithm tracks two indices, J (the current index being inspected) and JPREV (one of the previous indices). In some cases it deflates J and in other cases it deflates JPREV. This causes deflated values to be reordered sometimes. I traced how values are moved around in the failing instances:

-- DLASD7 deflated J    =           3 to IDXP          49
-- DLASD7 deflated JPREV=           2 to IDXP          48
-- DLASD7 deflated JPREV=           4 to IDXP          47
-- DLASD7 deflated JPREV=           5 to IDXP          46
-- DLASD7 deflated J    =           7 to IDXP          45
-- DLASD7 deflated JPREV=           6 to IDXP          44
-- DLASD7 deflated J    =           9 to IDXP          43
-- DLASD7 deflated J    =          10 to IDXP          42
-- DLASD7 deflated J    =          11 to IDXP          41
-- DLASD7 deflated J    =          12 to IDXP          40
-- DLASD7 deflated JPREV=          13 to IDXP          39
-- DLASD7 deflated JPREV=          14 to IDXP          38
-- DLASD7 deflated J    =          16 to IDXP          37
-- DLASD7 deflated JPREV=          15 to IDXP          36
-- DLASD7 deflated J    =          18 to IDXP          35
-- DLASD7 deflated JPREV=          17 to IDXP          34
-- DLASD7 deflated J    =          20 to IDXP          33
-- DLASD7 deflated J    =          21 to IDXP          32
-- DLASD7 deflated JPREV=          19 to IDXP          31
-- DLASD7 deflated J    =          23 to IDXP          30
-- DLASD7 deflated JPREV=          37 to IDXP          29
-- DLASD7 deflated J    =          41 to IDXP          28
-- DLASD7 deflated J    =          42 to IDXP          27
-- DLASD7 deflated J    =          43 to IDXP          26
-- DLASD7 deflated J    =          44 to IDXP          25
-- DLASD7 deflated J    =          45 to IDXP          24
-- DLASD7 deflated J    =          46 to IDXP          23
-- DLASD7 deflated J    =          47 to IDXP          22
-- DLASD7 deflated J    =          48 to IDXP          21
-- DLASD7 deflated J    =          49 to IDXP          20

There is mis-ordering in the following places:

input 3,2 -> output 49,48
input 7,6 -> output 45,44
input 16,15 -> output 37,36
input 18,17 -> output 35,34
input 20,21,19 -> output 33,32,31

When these mis-ordered values are eventually passed to DLAMRG, the additional checks I added report the following:

 -- DLAMRG input2 out-of-order at I=          12,13
 -- DLAMRG input2 out-of-order at I=          15,16
 -- DLAMRG input2 out-of-order at I=          17,18
 -- DLAMRG input2 out-of-order at I=          25,26
 -- DLAMRG input2 out-of-order at I=          29,30

These are indices into the second part of D so you have to add 19 (because N1=19, N2=30) to get the indices for the full output array produced by DLASD7. These match exactly with the locations where DLASD7 did its strange out-of-order deflation.

Jun 25 '25 10:06 MaartenBaert

That's some very interesting result. Both the documentation error and the ABS() are already in LAPACK 3.1.0 from 2006, so it would probably be hard to find out why they are there, e.g. if the ABS() was just "good practice" or put there to fix some other bug. (~~I've found a site that purports to store versions down to 1.0.0, but they are java archives so I'm not sure yet if they are actually original sources or a java project loosely based on them~~ nvm, that was just a package of java wrappers for lapack functions). It certainly looks like you've solved a 20 (if not 30) year old mystery

Jun 25 '25 12:06 martin-frbg

I've created a pull request for reference lapack, would you like me to do the same for OpenBLAS or do you prefer to wait for reference LAPACK to merge it first?

Jun 25 '25 12:06 MaartenBaert

As you like - I've copied your fix locally but will probably not have any more spectacular idea than running make lapack-test too. Currently looking through open LAPACK issues to see if this might solve something else too - I think there was one weird svd bug two or three years ago

Jun 25 '25 13:06 martin-frbg