kokkos-kernels icon indicating copy to clipboard operation
kokkos-kernels copied to clipboard

Nightly test failures with cusolver tpl enabled, Cuda.svd_* unit tests

Open ndellingwood opened this issue 2 years ago • 6 comments

Nightly test failures occurring with Cusolver enabled in the svd unit tests of the form "CUSOLVER does not support SVD for matrices with more columns than rows..."

...
[ RUN      ] Cuda.svd_float
Running impl_test_svd with sizes: 0x0
Running impl_test_svd with sizes: 1x1
Running impl_test_svd with sizes: 15x15
Running impl_test_svd with sizes: 100x100
Running impl_test_svd with sizes: 100x70
Running impl_test_svd with sizes: 70x100
unknown file: Failure
C++ exception with description "CUSOLVER does not support SVD for matrices with more columns than rows, you can transpose you matrix first then compute SVD of that transpose: At=VSUt, and swap the output U and Vt and transpose them to recover the desired SVD." thrown in the test body.
[  FAILED  ] Cuda.svd_float (215 ms)
[ RUN      ] Cuda.svd_double
Running impl_test_svd with sizes: 0x0
Running impl_test_svd with sizes: 1x1
Running impl_test_svd with sizes: 15x15
Running impl_test_svd with sizes: 100x100
Running impl_test_svd with sizes: 100x70
Running impl_test_svd with sizes: 70x100
unknown file: Failure
C++ exception with description "CUSOLVER does not support SVD for matrices with more columns than rows, you can transpose you matrix first then compute SVD of that transpose: At=VSUt, and swap the output U and Vt and transpose them to recover the desired SVD." thrown in the test body.
[  FAILED  ] Cuda.svd_double (169 ms)
[ RUN      ] Cuda.svd_complex_float
Running impl_test_svd with sizes: 0x0
Running impl_test_svd with sizes: 1x1
Running impl_test_svd with sizes: 15x15
Running impl_test_svd with sizes: 100x100
Running impl_test_svd with sizes: 100x70
Running impl_test_svd with sizes: 70x100
unknown file: Failure
C++ exception with description "CUSOLVER does not support SVD for matrices with more columns than rows, you can transpose you matrix first then compute SVD of that transpose: At=VSUt, and swap the output U and Vt and transpose them to recover the desired SVD." thrown in the test body.
[  FAILED  ] Cuda.svd_complex_float (181 ms)
[----------] 12 tests from Cuda (8252 ms total)

[----------] Global test environment tear-down
[==========] 12 tests from 1 test case ran. (8252 ms total)
[  PASSED  ] 9 tests.
[  FAILED  ] 3 tests, listed below:
[  FAILED  ] Cuda.svd_float
[  FAILED  ] Cuda.svd_double
[  FAILED  ] Cuda.svd_complex_float

Adding @lucbv , cross-reference #2092

Reproducer (weaver rhel8 queue):

source /projects/ppc64le-pwr9-rhel8/legacy-env.sh
module load cuda/11.2.2/gcc/8.3.1 cmake/3.23.1

${KOKKOSKERNELS_PATH}/cm_generate_makefile.bash --with-cuda --with-serial --compiler=${KOKKOS_PATH}/bin/nvcc_wrapper --arch=Volta70,Power9 --with-cuda-options=enable_lambda --kokkos-path=${KOKKOS_PATH} --kokkoskernels-path=${KOKKOSKERNELS_PATH}  --cxxflags=${CXXFLAGS} --with-scalars='float,complex_float' --with-ordinals=int --with-offsets=int,size_t --with-layouts=LayoutLeft --with-tpls=cusparse,cublas,cusolver --cxxstandard=17

ndellingwood avatar Feb 06 '24 18:02 ndellingwood

Hum, this was tested with the auto-tester but I guess there is at least a corner case in which we are calling the test when we really should not. I'll have a fix for that this week, thanks for pinging me @ndellingwood

lucbv avatar Feb 06 '24 19:02 lucbv

@lucbv thanks for #2103 , that resolved the "CUSOLVER does not support SVD for matrices with more columns than rows... type messages but I am still seeing tolerance-related failures in the cuda/11.2.2 build on Weaver

07:04:28 [ RUN      ] Cuda.svd_float
07:04:28 Running impl_test_svd with sizes: 0x0
07:04:28 Running impl_test_svd with sizes: 1x1
07:04:28 Running impl_test_svd with sizes: 15x15
07:04:28 Running impl_test_svd with sizes: 100x100
07:04:28 Running impl_test_svd with sizes: 100x70
07:04:28 Running impl_test_svd with sizes: 70x100
07:04:28 /home/jenkins/weaver/workspace/KokkosKernels_Weaver_Cuda_Serial_cuda_1122_gcc_831_cusparse_cublas_cusolver_float_cplxfloat/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp:159: Failure
07:04:28 Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
07:04:28 /home/jenkins/weaver/workspace/KokkosKernels_Weaver_Cuda_Serial_cuda_1122_gcc_831_cusparse_cublas_cusolver_float_cplxfloat/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp:159: Failure
07:04:28 Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
...
07:04:31 [  FAILED  ] Cuda.svd_float
07:04:31 [  FAILED  ] Cuda.svd_double
07:04:31 [  FAILED  ] Cuda.svd_complex_float

ndellingwood avatar Feb 13 '24 00:02 ndellingwood

@ndellingwood this particular one seems to be gone although there is a new Cuda issue in the nightly that looks like this:

Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
<https://jenkins-son.sandia.gov/job/KokkosKernels_Weaver_Cuda_Serial_cuda_1122_gcc_831_cusparse_cublas_cusolver_float_cplxfloat/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
<https://jenkins-son.sandia.gov/job/KokkosKernels_Weaver_Cuda_Serial_cuda_1122_gcc_831_cusparse_cublas_cusolver_float_cplxfloat/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
<https://jenkins-son.sandia.gov/job/KokkosKernels_Weaver_Cuda_Serial_cuda_1122_gcc_831_cusparse_cublas_cusolver_float_cplxfloat/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
<https://jenkins-son.sandia.gov/job/KokkosKernels_Weaver_Cuda_Serial_cuda_1122_gcc_831_cusparse_cublas_cusolver_float_cplxfloat/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
<https://jenkins-son.sandia.gov/job/KokkosKernels_Weaver_Cuda_Serial_cuda_1122_gcc_831_cusparse_cublas_cusolver_float_cplxfloat/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
<https://jenkins-son.sandia.gov/job/KokkosKernels_Weaver_Cuda_Serial_cuda_1122_gcc_831_cusparse_cublas_cusolver_float_cplxfloat/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209

and also an issue with oneMKL that looks similar?

Intel MKL ERROR: Parameter 8 was incorrect on entry to DGEMM .
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14

Intel MKL ERROR: Parameter 10 was incorrect on entry to DGEMM .
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14

Intel MKL ERROR: Parameter 8 was incorrect on entry to DGEMM .
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14

Intel MKL ERROR: Parameter 10 was incorrect on entry to DGEMM .
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 3 vs 6.66134e-14
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 4 vs 8.88178e-14
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 5 vs 1.11022e-13

Intel MKL ERROR: Parameter 10 was incorrect on entry to DGEMM .
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14

lucbv avatar Feb 13 '24 15:02 lucbv

@lucbv I posted as a follow on to this issue. I'll open a separate issue for tracking

ndellingwood avatar Feb 13 '24 17:02 ndellingwood

Okay let me know if you want to close this one then? I think the issue with the non-square matrices on CUDA should be resolved but the problem above is new so will need to investigate. The DGEMM complaint by MKL makes me think that there is a problem in how I check the unitary matrices or even the triple product for USVt = A... so hopefully should be an easy fix? It is a bit interesting that it only appears now and not in previous builds?

lucbv avatar Feb 13 '24 17:02 lucbv

@lucbv correct, this issue is resolved so I'll open a new issue for the different types of failures. There had been some preexisting MKL failures and I hadn't noticed the new stuff come through, thanks for catching that!

ndellingwood avatar Feb 13 '24 17:02 ndellingwood