Nightly test failures with cusolver tpl enabled, Cuda.svd_* unit tests
Nightly test failures occurring with Cusolver enabled in the svd unit tests of the form "CUSOLVER does not support SVD for matrices with more columns than rows..."
...
[ RUN ] Cuda.svd_float
Running impl_test_svd with sizes: 0x0
Running impl_test_svd with sizes: 1x1
Running impl_test_svd with sizes: 15x15
Running impl_test_svd with sizes: 100x100
Running impl_test_svd with sizes: 100x70
Running impl_test_svd with sizes: 70x100
unknown file: Failure
C++ exception with description "CUSOLVER does not support SVD for matrices with more columns than rows, you can transpose you matrix first then compute SVD of that transpose: At=VSUt, and swap the output U and Vt and transpose them to recover the desired SVD." thrown in the test body.
[ FAILED ] Cuda.svd_float (215 ms)
[ RUN ] Cuda.svd_double
Running impl_test_svd with sizes: 0x0
Running impl_test_svd with sizes: 1x1
Running impl_test_svd with sizes: 15x15
Running impl_test_svd with sizes: 100x100
Running impl_test_svd with sizes: 100x70
Running impl_test_svd with sizes: 70x100
unknown file: Failure
C++ exception with description "CUSOLVER does not support SVD for matrices with more columns than rows, you can transpose you matrix first then compute SVD of that transpose: At=VSUt, and swap the output U and Vt and transpose them to recover the desired SVD." thrown in the test body.
[ FAILED ] Cuda.svd_double (169 ms)
[ RUN ] Cuda.svd_complex_float
Running impl_test_svd with sizes: 0x0
Running impl_test_svd with sizes: 1x1
Running impl_test_svd with sizes: 15x15
Running impl_test_svd with sizes: 100x100
Running impl_test_svd with sizes: 100x70
Running impl_test_svd with sizes: 70x100
unknown file: Failure
C++ exception with description "CUSOLVER does not support SVD for matrices with more columns than rows, you can transpose you matrix first then compute SVD of that transpose: At=VSUt, and swap the output U and Vt and transpose them to recover the desired SVD." thrown in the test body.
[ FAILED ] Cuda.svd_complex_float (181 ms)
[----------] 12 tests from Cuda (8252 ms total)
[----------] Global test environment tear-down
[==========] 12 tests from 1 test case ran. (8252 ms total)
[ PASSED ] 9 tests.
[ FAILED ] 3 tests, listed below:
[ FAILED ] Cuda.svd_float
[ FAILED ] Cuda.svd_double
[ FAILED ] Cuda.svd_complex_float
Adding @lucbv , cross-reference #2092
Reproducer (weaver rhel8 queue):
source /projects/ppc64le-pwr9-rhel8/legacy-env.sh
module load cuda/11.2.2/gcc/8.3.1 cmake/3.23.1
${KOKKOSKERNELS_PATH}/cm_generate_makefile.bash --with-cuda --with-serial --compiler=${KOKKOS_PATH}/bin/nvcc_wrapper --arch=Volta70,Power9 --with-cuda-options=enable_lambda --kokkos-path=${KOKKOS_PATH} --kokkoskernels-path=${KOKKOSKERNELS_PATH} --cxxflags=${CXXFLAGS} --with-scalars='float,complex_float' --with-ordinals=int --with-offsets=int,size_t --with-layouts=LayoutLeft --with-tpls=cusparse,cublas,cusolver --cxxstandard=17
Hum, this was tested with the auto-tester but I guess there is at least a corner case in which we are calling the test when we really should not. I'll have a fix for that this week, thanks for pinging me @ndellingwood
@lucbv thanks for #2103 , that resolved the "CUSOLVER does not support SVD for matrices with more columns than rows... type messages but I am still seeing tolerance-related failures in the cuda/11.2.2 build on Weaver
07:04:28 [ RUN ] Cuda.svd_float
07:04:28 Running impl_test_svd with sizes: 0x0
07:04:28 Running impl_test_svd with sizes: 1x1
07:04:28 Running impl_test_svd with sizes: 15x15
07:04:28 Running impl_test_svd with sizes: 100x100
07:04:28 Running impl_test_svd with sizes: 100x70
07:04:28 Running impl_test_svd with sizes: 70x100
07:04:28 /home/jenkins/weaver/workspace/KokkosKernels_Weaver_Cuda_Serial_cuda_1122_gcc_831_cusparse_cublas_cusolver_float_cplxfloat/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp:159: Failure
07:04:28 Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
07:04:28 /home/jenkins/weaver/workspace/KokkosKernels_Weaver_Cuda_Serial_cuda_1122_gcc_831_cusparse_cublas_cusolver_float_cplxfloat/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp:159: Failure
07:04:28 Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
...
07:04:31 [ FAILED ] Cuda.svd_float
07:04:31 [ FAILED ] Cuda.svd_double
07:04:31 [ FAILED ] Cuda.svd_complex_float
@ndellingwood this particular one seems to be gone although there is a new Cuda issue in the nightly that looks like this:
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
<https://jenkins-son.sandia.gov/job/KokkosKernels_Weaver_Cuda_Serial_cuda_1122_gcc_831_cusparse_cublas_cusolver_float_cplxfloat/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
<https://jenkins-son.sandia.gov/job/KokkosKernels_Weaver_Cuda_Serial_cuda_1122_gcc_831_cusparse_cublas_cusolver_float_cplxfloat/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
<https://jenkins-son.sandia.gov/job/KokkosKernels_Weaver_Cuda_Serial_cuda_1122_gcc_831_cusparse_cublas_cusolver_float_cplxfloat/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
<https://jenkins-son.sandia.gov/job/KokkosKernels_Weaver_Cuda_Serial_cuda_1122_gcc_831_cusparse_cublas_cusolver_float_cplxfloat/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
<https://jenkins-son.sandia.gov/job/KokkosKernels_Weaver_Cuda_Serial_cuda_1122_gcc_831_cusparse_cublas_cusolver_float_cplxfloat/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
<https://jenkins-son.sandia.gov/job/KokkosKernels_Weaver_Cuda_Serial_cuda_1122_gcc_831_cusparse_cublas_cusolver_float_cplxfloat/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
and also an issue with oneMKL that looks similar?
Intel MKL ERROR: Parameter 8 was incorrect on entry to DGEMM .
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14
Intel MKL ERROR: Parameter 10 was incorrect on entry to DGEMM .
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14
Intel MKL ERROR: Parameter 8 was incorrect on entry to DGEMM .
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14
Intel MKL ERROR: Parameter 10 was incorrect on entry to DGEMM .
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 3 vs 6.66134e-14
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 4 vs 8.88178e-14
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 5 vs 1.11022e-13
Intel MKL ERROR: Parameter 10 was incorrect on entry to DGEMM .
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14
@lucbv I posted as a follow on to this issue. I'll open a separate issue for tracking
Okay let me know if you want to close this one then? I think the issue with the non-square matrices on CUDA should be resolved but the problem above is new so will need to investigate. The DGEMM complaint by MKL makes me think that there is a problem in how I check the unitary matrices or even the triple product for USVt = A... so hopefully should be an easy fix? It is a bit interesting that it only appears now and not in previous builds?
@lucbv correct, this issue is resolved so I'll open a new issue for the different types of failures. There had been some preexisting MKL failures and I hadn't noticed the new stuff come through, thanks for catching that!