lapack icon indicating copy to clipboard operation
lapack copied to clipboard

Failing tests for truncated QR routine in coverage build

Open ACSimon33 opened this issue 2 years ago • 10 comments

Description

Our coverage build was broken after the upgrade to 3.12.0 which led me to the bug in the lapack_testing.py script (see #954). After fixing that bug it was revealed that there are some tests which only fail in the coverage build:

                        -->   LAPACK TESTING SUMMARY  <--
                Processing LAPACK Testing output found in the TESTING directory
SUMMARY                 nb test run     numerical error         other error  
================        ===========     =================       ================  
REAL                    1328283         36885   (2.777%)        0       (0.000%)
DOUBLE PRECISION        1329105         36885   (2.775%)        0       (0.000%)
COMPLEX                 788035          36885   (4.681%)        0       (0.000%)
COMPLEX16               1029705         1       (0.000%)        0       (0.000%)

--> ALL PRECISIONS      4475128         110656  (2.473%)        0       (0.000%)

These tests are all related to the truncated QR routines:

testing_results.txt: SQK:  36885 out of 241365 tests failed to pass the threshold
testing_results.txt: DQK:  36885 out of 241365 tests failed to pass the threshold
testing_results.txt: CQK:  36885 out of 241695 tests failed to pass the threshold
Test ratios:
    1: 2-norm(svd(A) - svd(R)) / ( max(M,N) * 2-norm(svd(R)) * EPS )
    2: 1-norm( A*P - Q*R ) / ( max(M,N) * 1-norm(A) * EPS )
    3: 1-norm( I - Q'*Q ) / ( M * EPS )
    4: Returns 1.0D+100, if abs(R(K+1,K+1)) > abs(R(K,K)), where K=1:KFACT-1
    5: 1-norm(Q**T * B - Q**T * B ) / ( M * EPS )
 Messages:
 DGEQP3RK M =    2, N =    2, NRHS =    1, KMAX =    2, ABSTOL = -1.0000    , RELTOL = -1.0000    , NB =   1, NX =   1, type  2, test  4, ratio = 0.15179E+73
 DGEQP3RK M =    2, N =    2, NRHS =    1, KMAX =    3, ABSTOL = -1.0000    , RELTOL = -1.0000    , NB =   1, NX =   1, type  2, test  4, ratio = 0.15179E+73
 DGEQP3RK M =    2, N =    2, NRHS =    1, KMAX =    2, ABSTOL = -1.0000    , RELTOL = -1.0000    , NB =   3, NX =   0, type  2, test  4, ratio = 0.15179E+73
 DGEQP3RK M =    2, N =    2, NRHS =    1, KMAX =    3, ABSTOL = -1.0000    , RELTOL = -1.0000    , NB =   3, NX =   0, type  2, test  4, ratio = 0.15179E+73
 DGEQP3RK M =    2, N =    2, NRHS =    1, KMAX =    2, ABSTOL = -1.0000    , RELTOL = -1.0000    , NB =   3, NX =   5, type  2, test  4, ratio = 0.15179E+73

It is always the 4th test which fails for all kinds of matrices. Weirdly, the COMPLEX16 routines don't have that issue and if I build without LAPACKE the COMPLEX tests are also fine. To reproduce this issue just build with -DCMAKE_BUILD_TYPE=coverage.


Hi @scr2016, I guess you know the most about these routines. Do you have any ideas about what might go wrong here?

ACSimon33 avatar Dec 01 '23 14:12 ACSimon33

@ACSimon33 In my environment, I have reproduced these failures even not for coverage build.

Looks like root-cause is uninitialized variable RESULT( 4 ) inside test routine, for example in TESTING/LIN/dchkqp3rk.f. Initially, it contains trash, since it can be uninitialized due to false of condition

              IF( DTEMP.LT.ZERO ) THEN
                  RESULT( 4 ) = BIGNUM
              END IF

in normal case.

That's why final check for thresh

IF( RESULT( 4 ).GE.THRESH ) THEN

is always true, which lead to every test failure.

Somewhere above we should set

RESULT( 4 ) = ZERO

dklyuchinskiy avatar Dec 14 '23 06:12 dklyuchinskiy

@dklyuchinskiy Nice catch! Should I create a MR or do you want to do that?

ACSimon33 avatar Dec 15 '23 08:12 ACSimon33

@dklyuchinskiy Nice catch! Should I create a MR or do you want to do that?

@ACSimon33 I will be glad, if you create MR and check fix with coverage build. I did not work with it before.

Also, I am confused with some other places inside test.

  1. According to the documentation, condition 4 is
Returns 1.0D+100 if abs(R(K+1,K+1)) > abs(R(K,K)),  K=1:KFACT-1
The elements on the diagonal of R should be non-increasing.

But after that we check the condition

                        DTEMP = (( ABS( A( (J-1)*M+J ) ) -
     $                          ABS( A( (J)*M+J+1 ) ) ) /
     $                          ABS( A(1) ) )

Indexes point to sub-diagonal elements of A (or R). Is it equal to the documentation?

  1. In the formula above we should use LDA instead M, I guess.

Please correct me, if I am wrong.

dklyuchinskiy avatar Dec 15 '23 09:12 dklyuchinskiy

@dklyuchinskiy I think the indices are actually pointing to the diagonal since Fortran is 1-indexed. So, for example if M=10 and J=1 it will be (A(1) - A(12))/A(1), which is the first diagonal element minus the second one scaled by the first. So, the test itself is correct.

I agree that we should use LDA even if it doesn't make a difference for the test (LDA=max(1,M)) because the test is only executed if the matrix rank is greater than 2.

ACSimon33 avatar Dec 15 '23 18:12 ACSimon33

This bug is currently in the process of fixing. This is a test number 4 which currently does not affect the correctness of the routine code/results. The test should check (with some care) if ABS values of the diagonal elements are non-increasing.

@ACSimon33 could you please provide:

  1. the information about your system environment;
  2. If the failing tests report that you provided in your original bug report if not complete (i.e. truncated), please prove the full output.

Thank you.

scr2016 avatar Dec 15 '23 20:12 scr2016

Hi @scr2016, please have a look at the PR which is linked in this issue. The problem was just an un initialized RESULT vector as far as I can tell. At least it fixed the issue on my side and all tests are passing now.

I can reproduce the old errors tomorrow if you think that it’s still necessary.

ACSimon33 avatar Dec 15 '23 20:12 ACSimon33

@ACSimon33. The complete test error output and the environment information would help to check the issue thoroughly.

Thank you in advance.

On Fri, Dec 15, 2023 at 12:59 PM Simon Lukas Märtens < @.***> wrote:

Hi @scr2016 https://github.com/scr2016, please have a look at the PR which is linked in this issue. The problem was just an un initialized RESULT vector as far as I can tell. At least it fixed the issue on my side and all tests are passing now.

I can reproduce the old errors tomorrow if you think that it’s still necessary.

— Reply to this email directly, view it on GitHub https://github.com/Reference-LAPACK/lapack/issues/956#issuecomment-1858480016, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHYAZEJREFTINLOIHDCUSDYJS24LAVCNFSM6AAAAABAC53LV6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJYGQ4DAMBRGY . You are receiving this because you were mentioned.Message ID: @.***>

scr2016 avatar Dec 15 '23 21:12 scr2016

@dklyuchinskiy I think the indices are actually pointing to the diagonal since Fortran is 1-indexed. So, for example if M=10 and J=1 it will be (A(1) - A(12))/A(1), which is the first diagonal element minus the second one scaled by the first. So, the test itself is correct.

@ACSimon33 Yeap, thank you for explanation! You are right! My fault :)

dklyuchinskiy avatar Dec 18 '23 07:12 dklyuchinskiy

@ACSimon33. The complete test error output and the environment information would help to check the issue thoroughly. Thank you in advance.

@scr2016 Here are the complete test results: LAPACK_test_results.txt

I compiled with GCC 13.2 on CentOS Linux 7 (Core). The issues only appear in the coverage build for me:

mkdir build && cd build
cmake -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DCMAKE_Fortran_COMPILER=gfortran -DCMAKE_BUILD_TYPE=coverage ..
make -j8
ctest -j8

ACSimon33 avatar Dec 18 '23 12:12 ACSimon33

@scr2016 I tried with some more GCC version (4.8.5, 5.5.0, 6.5.0, 7.5.0, 8.4.0, 9.3.0, 10.3.0, 12.2.0, 13.2.0). The issue only exists for GCC >= 7.5.0.

ACSimon33 avatar Dec 18 '23 12:12 ACSimon33