Nightly test failures, Cuda.svd_* and MKL DGEMM
Nightly test failures, follow up to #2096
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
<https://jenkins-son.sandia.gov/job/KokkosKernels_Weaver_Cuda_Serial_cuda_1122_gcc_831_cusparse_cublas_cusolver_float_cplxfloat/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
<https://jenkins-son.sandia.gov/job/KokkosKernels_Weaver_Cuda_Serial_cuda_1122_gcc_831_cusparse_cublas_cusolver_float_cplxfloat/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
<https://jenkins-son.sandia.gov/job/KokkosKernels_Weaver_Cuda_Serial_cuda_1122_gcc_831_cusparse_cublas_cusolver_float_cplxfloat/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
<https://jenkins-son.sandia.gov/job/KokkosKernels_Weaver_Cuda_Serial_cuda_1122_gcc_831_cusparse_cublas_cusolver_float_cplxfloat/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
<https://jenkins-son.sandia.gov/job/KokkosKernels_Weaver_Cuda_Serial_cuda_1122_gcc_831_cusparse_cublas_cusolver_float_cplxfloat/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
<https://jenkins-son.sandia.gov/job/KokkosKernels_Weaver_Cuda_Serial_cuda_1122_gcc_831_cusparse_cublas_cusolver_float_cplxfloat/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
and also an issue with oneMKL that looks similar?
Intel MKL ERROR: Parameter 8 was incorrect on entry to DGEMM .
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14
Intel MKL ERROR: Parameter 10 was incorrect on entry to DGEMM .
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14
Intel MKL ERROR: Parameter 8 was incorrect on entry to DGEMM .
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14
Intel MKL ERROR: Parameter 10 was incorrect on entry to DGEMM .
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 3 vs 6.66134e-14
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 4 vs 8.88178e-14
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 5 vs 1.11022e-13
Intel MKL ERROR: Parameter 10 was incorrect on entry to DGEMM .
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14
Originally posted by @lucbv in https://github.com/kokkos/kokkos-kernels/issues/2096#issuecomment-1941819656
Reproducer (weaver rhel8 queue):
source /projects/ppc64le-pwr9-rhel8/legacy-env.sh
module load cuda/11.2.2/gcc/8.3.1 cmake/3.23.1
${KOKKOSKERNELS_PATH}/cm_generate_makefile.bash --with-cuda --with-serial --compiler=${KOKKOS_PATH}/bin/nvcc_wrapper --arch=Volta70,Power9 --with-cuda-options=enable_lambda --kokkos-path=${KOKKOS_PATH} --kokkoskernels-path=${KOKKOSKERNELS_PATH} --cxxflags=${CXXFLAGS} --with-scalars='float,complex_float' --with-ordinals=int --with-offsets=int,size_t --with-layouts=LayoutLeft --with-tpls=cusparse,cublas,cusolver --cxxstandard=17
@ndellingwood the PR above should fix the CUDA side of the problem. I will test it on the Blake oneAPI build and see if that cleans things up there as well. Interestingly I did not do any thing for PVC so I would expect that to not run anything but if we have MKL enabled on the host side there could be something going on there... maybe will do a second PR so that we can merge the first one quickly to clean-up some of our nightly builds!
Okay, so far not seeing the CUDA error this morning, let us wait until the afternoon for potentially late tests finishing later but this looks like a promising start. I'll have a look at the Intel/MKL issue, hopefully I can sort it out and close this issue soon! : )
Okay PR #2110 just merged so let's keep an eye on this. I plan on making a bigger subsequent PR that will address all of the BLAS kernels so that they can run properly depending on MKL's choice of integer width... This should clean significantly some segfaults we see in the nightly oneapi builds!
@ndellingwood this should be resolved now, I did not see the error come up in last night's build. One more thing though, I am adding PR #2112 to generalize the fix and hopefully clean-up some of our oneMKL issues.
Okay, PR #2112 has merged now, let us see if we see improvements in our nightly build on Blake. I think some should have quite a few unit-test passing now!