kokkos-kernels
kokkos-kernels copied to clipboard
Test failures in clang >= 10 + cuda builds
In builds clang+cuda builds (e.g. clang/10+cuda/10.1, clang/13+cuda/11.7 tested) the following unit tests are failing on the develop and release-candidate-3.7.00 branches
sparse_cuda:
[ RUN ] cuda.sparse_spgemm_double_int_size_t_TestExecSpace
entries are different.
0 2 3 5 8 11 12 13 15 16 19 20 24 32 34 36 37 38 41 42 ... ... ... 9963 9966 9968 9969 9971 9973 9974 9975 9980 9981 9982 9983 9986 9987 9988 9991 9993 9994 9995 9999
0 2 3 5 8 11 12 13 15 16 19 20 24 32 34 36 37 38 41 42 ... ... ... 9963 9966 9968 9969 9971 9973 9974 9975 9980 9981 9982 9983 9986 9987 9988 9991 9993 9994 9995 9999
/ascldap/users/ndellin/kokkos-kernels/unit_test/sparse/Test_Sparse_spgemm.hpp:360: Failure
Value of: is_identical
Actual: false
Expected: true
SPGEMM_KK
...
[ RUN ] cuda.sparse_block_spgemm_double_int_size_t_TestExecSpace
entries are different.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ... ... ... 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ... ... ... 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
/ascldap/users/ndellin/kokkos-kernels/unit_test/sparse/Test_Sparse_bspgemm.hpp:286: Failure
Value of: is_identical
Actual: false
Expected: true
SPGEMM_KK
entries are different.
1 1 2 3 4 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ... ... ... 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ... ... ... 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499
/ascldap/users/ndellin/kokkos-kernels/unit_test/sparse/Test_Sparse_bspgemm.hpp:286: Failure
Value of: is_identical
Actual: false
Expected: true
SPGEMM_KK
...
[ FAILED ] cuda.sparse_spgemm_double_int_size_t_TestExecSpace
[ FAILED ] cuda.sparse_block_spgemm_double_int_size_t_TestExecSpace
batched_dla_cuda: timeout
[ RUN ] cuda.batched_scalar_batched_gemm_nt_nt_bhalf_bhalf_left
[ OK ] cuda.batched_scalar_batched_gemm_nt_nt_bhalf_bhalf_left (104915 ms)
[ RUN ] cuda.batched_scalar_batched_gemm_t_nt_bhalf_bhalf_left
[ OK ] cuda.batched_scalar_batched_gemm_t_nt_bhalf_bhalf_left (105098 ms)
[ RUN ] cuda.batched_scalar_batched_gemm_nt_t_bhalf_bhalf_left
[ OK ] cuda.batched_scalar_batched_gemm_nt_t_bhalf_bhalf_left (104866 ms)
[ RUN ] cuda.batched_scalar_batched_gemm_t_t_bhalf_bhalf_left
[ OK ] cuda.batched_scalar_batched_gemm_t_t_bhalf_bhalf_left (105015 ms)
[ RUN ] cuda.batched_scalar_batched_gemm_nt_nt_bhalf_bhalf_right
[ OK ] cuda.batched_scalar_batched_gemm_nt_nt_bhalf_bhalf_right (115381 ms)
[ RUN ] cuda.batched_scalar_batched_gemm_t_nt_bhalf_bhalf_right
[ OK ] cuda.batched_scalar_batched_gemm_t_nt_bhalf_bhalf_right (115601 ms)
[ RUN ] cuda.batched_scalar_batched_gemm_nt_t_bhalf_bhalf_right
[ OK ] cuda.batched_scalar_batched_gemm_nt_t_bhalf_bhalf_right (115549 ms)
[ RUN ] cuda.batched_scalar_batched_gemm_t_t_bhalf_bhalf_right
[ OK ] cuda.batched_scalar_batched_gemm_t_t_bhalf_bhalf_right (115463 ms)
[ RUN ] cuda.batched_scalar_batched_gemm_nt_nt_half_half_left
[ OK ] cuda.batched_scalar_batched_gemm_nt_nt_half_half_left (165243 ms)
[ RUN ] cuda.batched_scalar_batched_gemm_t_nt_half_half_left
[ OK ] cuda.batched_scalar_batched_gemm_t_nt_half_half_left (165299 ms)
[ RUN ] cuda.batched_scalar_batched_gemm_nt_t_half_half_left
# Timeout here
Reproducer (kokkos-dev-2):
source /projects/sems/modulefiles/utils/sems-archive-modules-init.sh
module load sems-archive-env
module load sems-archive-gcc/7.3.0 sems-archive-clang/10.0.0 sems-archive-cuda/10.1 sems-archive-cmake/3.19.1
$KOKKOSKERNELS_PATH/cm_generate_makefile.bash --with-cuda --compiler=clang++ --arch=Volta70
The batched gemm tests do take more cycles than our other unit-tests; I suggest increasing the timeout.
@ndellingwood: Are the batched gemm timeouts a recent regression or are you running these tests for the first time with clang >= 10 + cuda?
@ndellingwood: Are the batched gemm timeouts a recent regression or are you running these tests for the first time with clang >= 10 + cuda?
@e10harvey these tests had passed, though I can't recall a previous date / sha to give better info
I'm going to rerun the tests toggling Kokkos_ENABLE_COMPLEX_ALIGN, this had an impact on the https://github.com/kokkos/kokkos/issues/5312 and so this may be an underlying Kokkos issue. Will post back once I finish testing
I tested builds with -DKokkos_ENABLE_COMPLEX_ALIGN=ON and -DKokkos_ENABLE_COMPLEX_ALIGN=OFF, the cuda.sparse_spgemm_double_int_size_t_TestExecSpace and cuda.sparse_block_spgemm_double_int_size_t_TestExecSpace tests fail in either case.
I rebuilt with -DKokkos_ENABLE_DEBUG=ON and -DKokkos_ENABLE_DEBUG_BOUNDS_CHECK=ON, but there was no additional useful diagnostic info beyond the output in the OP.
Also seeing these warning with this build:
ptxas warning : Unresolved extern variable '_ZN6Kokkos12_GLOBAL__N_13ALLE' in whole program compilation, ignoring extern qualifier
Demangled _ZN6Kokkos12_GLOBAL__N_13ALLE
[ndellin@kokkos-dev-2 Clang10Cuda101Sems-aligntest]$ c++filt -t _ZN6Kokkos12_GLOBAL__N_13ALLE
Kokkos::(anonymous namespace)::ALL
Yeah, that happened a bunch with OpenMP Target too, I will ask about it on the Kokkos channel, I'm not sure it's related though... Also I have a build going on Kokkos-dev2 so should be able to assess this problem soon.
@lucbv reproduced the failure and this is present since at least the 3.6.00 release; removing blocker on 3.7.00 promotion