Nightly test failures in Cuda with rdc+uvm builds: cuda.sparse_block_spgemm tests
@lucbv also seeing some runtime test failures after merge of PR #1099 as well (there were no changes merged to kokkos the day this test began failing), for example in cuda/10.0 build with rdc and uvm enabled:
08:38:11 4: [ RUN ] cuda.sparse_block_spgemm_kokkos_complex_double_int_int_TestExecSpace
08:38:11 4: /home/jenkins/weaver-new/workspace/KokkosKernels_Weaver_Cuda_cuda_100_gcc_740_rdc-uvm/kokkos-kernels/unit_test/sparse/Test_Sparse_bspgemm.hpp:274: Failure
08:38:11 4: Value of: is_expected_to_fail
08:38:11 4: Actual: false
08:38:11 4: Expected: true
08:38:11 4: SPGEMM_KK: Kokkos::Impl::ParallelFor< Cuda > requested too large team size.
08:38:11 4: /home/jenkins/weaver-new/workspace/KokkosKernels_Weaver_Cuda_cuda_100_gcc_740_rdc-uvm/kokkos-kernels/unit_test/sparse/Test_Sparse_bspgemm.hpp:277: Failure
08:38:11 4: Value of: failed
08:38:11 4: Actual: true
08:38:11 4: Expected: is_expected_to_fail
08:38:11 4: Which is: false
08:38:11 4: entries are different.
08:38:11 4: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... ... ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08:38:11 4: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ... ... ... 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499
08:38:11 4: /home/jenkins/weaver-new/workspace/KokkosKernels_Weaver_Cuda_cuda_100_gcc_740_rdc-uvm/kokkos-kernels/unit_test/sparse/Test_Sparse_bspgemm.hpp:285: Failure
08:38:11 4: Value of: is_identical
08:38:11 4: Actual: false
08:38:11 4: Expected: true
08:38:11 4: SPGEMM_KK
08:38:11 4: [ FAILED ] cuda.sparse_block_spgemm_kokkos_complex_double_int_int_TestExecSpace (8487 ms)
08:38:11 4: [ RUN ] cuda.sparse_block_spgemm_kokkos_complex_double_int_size_t_TestExecSpace
08:38:11 4: /home/jenkins/weaver-new/workspace/KokkosKernels_Weaver_Cuda_cuda_100_gcc_740_rdc-uvm/kokkos-kernels/unit_test/sparse/Test_Sparse_bspgemm.hpp:274: Failure
08:38:11 4: Value of: is_expected_to_fail
08:38:11 4: Actual: false
08:38:11 4: Expected: true
08:38:11 4: SPGEMM_KK: Kokkos::Impl::ParallelFor< Cuda > requested too large team size.
08:38:11 4: /home/jenkins/weaver-new/workspace/KokkosKernels_Weaver_Cuda_cuda_100_gcc_740_rdc-uvm/kokkos-kernels/unit_test/sparse/Test_Sparse_bspgemm.hpp:277: Failure
08:38:11 4: Value of: failed
08:38:11 4: Actual: true
08:38:11 4: Expected: is_expected_to_fail
08:38:11 4: Which is: false
08:38:11 4: entries are different.
08:38:11 4: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... ... ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08:38:11 4: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ... ... ... 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499
08:38:11 4: /home/jenkins/weaver-new/workspace/KokkosKernels_Weaver_Cuda_cuda_100_gcc_740_rdc-uvm/kokkos-kernels/unit_test/sparse/Test_Sparse_bspgemm.hpp:285: Failure
08:38:11 4: Value of: is_identical
08:38:11 4: Actual: false
08:38:11 4: Expected: true
08:38:11 4: SPGEMM_KK
08:38:11 4: [ FAILED ] cuda.sparse_block_spgemm_kokkos_complex_double_int_size_t_TestExecSpace (8494 ms)
Reproducer (weaver):
module load cmake/3.19.3 cuda/10.0.130 ibm/xl/16.1.1 gcc/7.4.0
$KOKKOSKERNELS_PATH/cm_generate_makefile.bash --with-devices=Cuda,Serial --arch=Power9,Volta70 --compiler=$KOKKOS_PATH/bin/nvcc_wrapper --cxxflags="-O3 -Wall -Wunused-parameter -Wshadow -pedantic -Werror -Wsign-compare -Wtype-limits -Wuninitialized " --cxxstandard="14" --kokkos-path=$KOKKOS_PATH --kokkoskernels-path=$KOKKOSKERNELS_PATH --with-scalars='double,complex_double' --with-ordinals=int --with-offsets=int,size_t --with-layouts=LayoutLeft --with-cuda-options=enable_lambda,uvm,rdc --no-examples
Edit: Also occurs with cuda/9.2.88 on the same system
Originally posted by @ndellingwood in https://github.com/kokkos/kokkos-kernels/issues/1395#issuecomment-1115151520
I split this issue out from #1395 (filed as build errors) which had collected various nightly failures following #1099
This should be resolved with @brian-kelley PR #1470 @ndellingwood the PR was just merged this morning, let's keep an eye on this tomorrow, hopefully we should see the uvm+rdc build passing.
The rdc+uvm nightlies that had #1470 merged resumed passing, thanks for fix!