kokkos-kernels icon indicating copy to clipboard operation
kokkos-kernels copied to clipboard

Nightly test failures, Cuda.svd_* and MKL DGEMM

Open ndellingwood opened this issue 2 years ago • 5 comments

Nightly test failures, follow up to #2096

Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
<https://jenkins-son.sandia.gov/job/KokkosKernels_Weaver_Cuda_Serial_cuda_1122_gcc_831_cusparse_cublas_cusolver_float_cplxfloat/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
<https://jenkins-son.sandia.gov/job/KokkosKernels_Weaver_Cuda_Serial_cuda_1122_gcc_831_cusparse_cublas_cusolver_float_cplxfloat/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
<https://jenkins-son.sandia.gov/job/KokkosKernels_Weaver_Cuda_Serial_cuda_1122_gcc_831_cusparse_cublas_cusolver_float_cplxfloat/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
<https://jenkins-son.sandia.gov/job/KokkosKernels_Weaver_Cuda_Serial_cuda_1122_gcc_831_cusparse_cublas_cusolver_float_cplxfloat/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
<https://jenkins-son.sandia.gov/job/KokkosKernels_Weaver_Cuda_Serial_cuda_1122_gcc_831_cusparse_cublas_cusolver_float_cplxfloat/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
<https://jenkins-son.sandia.gov/job/KokkosKernels_Weaver_Cuda_Serial_cuda_1122_gcc_831_cusparse_cublas_cusolver_float_cplxfloat/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209

and also an issue with oneMKL that looks similar?

Intel MKL ERROR: Parameter 8 was incorrect on entry to DGEMM .
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14

Intel MKL ERROR: Parameter 10 was incorrect on entry to DGEMM .
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14

Intel MKL ERROR: Parameter 8 was incorrect on entry to DGEMM .
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14

Intel MKL ERROR: Parameter 10 was incorrect on entry to DGEMM .
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 3 vs 6.66134e-14
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 4 vs 8.88178e-14
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 5 vs 1.11022e-13

Intel MKL ERROR: Parameter 10 was incorrect on entry to DGEMM .
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14

Originally posted by @lucbv in https://github.com/kokkos/kokkos-kernels/issues/2096#issuecomment-1941819656

Reproducer (weaver rhel8 queue):

source /projects/ppc64le-pwr9-rhel8/legacy-env.sh
module load cuda/11.2.2/gcc/8.3.1 cmake/3.23.1

${KOKKOSKERNELS_PATH}/cm_generate_makefile.bash --with-cuda --with-serial --compiler=${KOKKOS_PATH}/bin/nvcc_wrapper --arch=Volta70,Power9 --with-cuda-options=enable_lambda --kokkos-path=${KOKKOS_PATH} --kokkoskernels-path=${KOKKOSKERNELS_PATH}  --cxxflags=${CXXFLAGS} --with-scalars='float,complex_float' --with-ordinals=int --with-offsets=int,size_t --with-layouts=LayoutLeft --with-tpls=cusparse,cublas,cusolver --cxxstandard=17

ndellingwood avatar Feb 13 '24 17:02 ndellingwood

@ndellingwood the PR above should fix the CUDA side of the problem. I will test it on the Blake oneAPI build and see if that cleans things up there as well. Interestingly I did not do any thing for PVC so I would expect that to not run anything but if we have MKL enabled on the host side there could be something going on there... maybe will do a second PR so that we can merge the first one quickly to clean-up some of our nightly builds!

lucbv avatar Feb 14 '24 17:02 lucbv

Okay, so far not seeing the CUDA error this morning, let us wait until the afternoon for potentially late tests finishing later but this looks like a promising start. I'll have a look at the Intel/MKL issue, hopefully I can sort it out and close this issue soon! : )

lucbv avatar Feb 15 '24 13:02 lucbv

Okay PR #2110 just merged so let's keep an eye on this. I plan on making a bigger subsequent PR that will address all of the BLAS kernels so that they can run properly depending on MKL's choice of integer width... This should clean significantly some segfaults we see in the nightly oneapi builds!

lucbv avatar Feb 15 '24 23:02 lucbv

@ndellingwood this should be resolved now, I did not see the error come up in last night's build. One more thing though, I am adding PR #2112 to generalize the fix and hopefully clean-up some of our oneMKL issues.

lucbv avatar Feb 16 '24 18:02 lucbv

Okay, PR #2112 has merged now, let us see if we see improvements in our nightly build on Blake. I think some should have quite a few unit-test passing now!

lucbv avatar Feb 19 '24 16:02 lucbv