Failing tests for truncated QR routine in coverage build
Description
Our coverage build was broken after the upgrade to 3.12.0 which led me to the bug in the lapack_testing.py script (see #954). After fixing that bug it was revealed that there are some tests which only fail in the coverage build:
--> LAPACK TESTING SUMMARY <--
Processing LAPACK Testing output found in the TESTING directory
SUMMARY nb test run numerical error other error
================ =========== ================= ================
REAL 1328283 36885 (2.777%) 0 (0.000%)
DOUBLE PRECISION 1329105 36885 (2.775%) 0 (0.000%)
COMPLEX 788035 36885 (4.681%) 0 (0.000%)
COMPLEX16 1029705 1 (0.000%) 0 (0.000%)
--> ALL PRECISIONS 4475128 110656 (2.473%) 0 (0.000%)
These tests are all related to the truncated QR routines:
testing_results.txt: SQK: 36885 out of 241365 tests failed to pass the threshold
testing_results.txt: DQK: 36885 out of 241365 tests failed to pass the threshold
testing_results.txt: CQK: 36885 out of 241695 tests failed to pass the threshold
Test ratios:
1: 2-norm(svd(A) - svd(R)) / ( max(M,N) * 2-norm(svd(R)) * EPS )
2: 1-norm( A*P - Q*R ) / ( max(M,N) * 1-norm(A) * EPS )
3: 1-norm( I - Q'*Q ) / ( M * EPS )
4: Returns 1.0D+100, if abs(R(K+1,K+1)) > abs(R(K,K)), where K=1:KFACT-1
5: 1-norm(Q**T * B - Q**T * B ) / ( M * EPS )
Messages:
DGEQP3RK M = 2, N = 2, NRHS = 1, KMAX = 2, ABSTOL = -1.0000 , RELTOL = -1.0000 , NB = 1, NX = 1, type 2, test 4, ratio = 0.15179E+73
DGEQP3RK M = 2, N = 2, NRHS = 1, KMAX = 3, ABSTOL = -1.0000 , RELTOL = -1.0000 , NB = 1, NX = 1, type 2, test 4, ratio = 0.15179E+73
DGEQP3RK M = 2, N = 2, NRHS = 1, KMAX = 2, ABSTOL = -1.0000 , RELTOL = -1.0000 , NB = 3, NX = 0, type 2, test 4, ratio = 0.15179E+73
DGEQP3RK M = 2, N = 2, NRHS = 1, KMAX = 3, ABSTOL = -1.0000 , RELTOL = -1.0000 , NB = 3, NX = 0, type 2, test 4, ratio = 0.15179E+73
DGEQP3RK M = 2, N = 2, NRHS = 1, KMAX = 2, ABSTOL = -1.0000 , RELTOL = -1.0000 , NB = 3, NX = 5, type 2, test 4, ratio = 0.15179E+73
It is always the 4th test which fails for all kinds of matrices. Weirdly, the COMPLEX16 routines don't have that issue and if I build without LAPACKE the COMPLEX tests are also fine. To reproduce this issue just build with -DCMAKE_BUILD_TYPE=coverage.
Hi @scr2016, I guess you know the most about these routines. Do you have any ideas about what might go wrong here?
@ACSimon33 In my environment, I have reproduced these failures even not for coverage build.
Looks like root-cause is uninitialized variable RESULT( 4 ) inside test routine, for example in TESTING/LIN/dchkqp3rk.f. Initially, it contains trash, since it can be uninitialized due to false of condition
IF( DTEMP.LT.ZERO ) THEN
RESULT( 4 ) = BIGNUM
END IF
in normal case.
That's why final check for thresh
IF( RESULT( 4 ).GE.THRESH ) THEN
is always true, which lead to every test failure.
Somewhere above we should set
RESULT( 4 ) = ZERO
@dklyuchinskiy Nice catch! Should I create a MR or do you want to do that?
@dklyuchinskiy Nice catch! Should I create a MR or do you want to do that?
@ACSimon33 I will be glad, if you create MR and check fix with coverage build. I did not work with it before.
Also, I am confused with some other places inside test.
- According to the documentation, condition 4 is
Returns 1.0D+100 if abs(R(K+1,K+1)) > abs(R(K,K)), K=1:KFACT-1
The elements on the diagonal of R should be non-increasing.
But after that we check the condition
DTEMP = (( ABS( A( (J-1)*M+J ) ) -
$ ABS( A( (J)*M+J+1 ) ) ) /
$ ABS( A(1) ) )
Indexes point to sub-diagonal elements of A (or R). Is it equal to the documentation?
- In the formula above we should use
LDAinsteadM, I guess.
Please correct me, if I am wrong.
@dklyuchinskiy I think the indices are actually pointing to the diagonal since Fortran is 1-indexed. So, for example if M=10 and J=1 it will be (A(1) - A(12))/A(1), which is the first diagonal element minus the second one scaled by the first. So, the test itself is correct.
I agree that we should use LDA even if it doesn't make a difference for the test (LDA=max(1,M)) because the test is only executed if the matrix rank is greater than 2.
This bug is currently in the process of fixing. This is a test number 4 which currently does not affect the correctness of the routine code/results. The test should check (with some care) if ABS values of the diagonal elements are non-increasing.
@ACSimon33 could you please provide:
- the information about your system environment;
- If the failing tests report that you provided in your original bug report if not complete (i.e. truncated), please prove the full output.
Thank you.
Hi @scr2016, please have a look at the PR which is linked in this issue. The problem was just an un initialized RESULT vector as far as I can tell. At least it fixed the issue on my side and all tests are passing now.
I can reproduce the old errors tomorrow if you think that it’s still necessary.
@ACSimon33. The complete test error output and the environment information would help to check the issue thoroughly.
Thank you in advance.
On Fri, Dec 15, 2023 at 12:59 PM Simon Lukas Märtens < @.***> wrote:
Hi @scr2016 https://github.com/scr2016, please have a look at the PR which is linked in this issue. The problem was just an un initialized RESULT vector as far as I can tell. At least it fixed the issue on my side and all tests are passing now.
I can reproduce the old errors tomorrow if you think that it’s still necessary.
— Reply to this email directly, view it on GitHub https://github.com/Reference-LAPACK/lapack/issues/956#issuecomment-1858480016, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHYAZEJREFTINLOIHDCUSDYJS24LAVCNFSM6AAAAABAC53LV6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJYGQ4DAMBRGY . You are receiving this because you were mentioned.Message ID: @.***>
@dklyuchinskiy I think the indices are actually pointing to the diagonal since Fortran is 1-indexed. So, for example if
M=10andJ=1it will be(A(1) - A(12))/A(1), which is the first diagonal element minus the second one scaled by the first. So, the test itself is correct.
@ACSimon33 Yeap, thank you for explanation! You are right! My fault :)
@ACSimon33. The complete test error output and the environment information would help to check the issue thoroughly. Thank you in advance.
@scr2016 Here are the complete test results: LAPACK_test_results.txt
I compiled with GCC 13.2 on CentOS Linux 7 (Core). The issues only appear in the coverage build for me:
mkdir build && cd build
cmake -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DCMAKE_Fortran_COMPILER=gfortran -DCMAKE_BUILD_TYPE=coverage ..
make -j8
ctest -j8
@scr2016 I tried with some more GCC version (4.8.5, 5.5.0, 6.5.0, 7.5.0, 8.4.0, 9.3.0, 10.3.0, 12.2.0, 13.2.0). The issue only exists for GCC >= 7.5.0.