kokkos-kernels icon indicating copy to clipboard operation
kokkos-kernels copied to clipboard

{serial,openmp}.sparse_spmv_mv failures with gcc/10 and gcc/10+armpl/21

Open e10harvey opened this issue 4 years ago • 5 comments

@lucbv, when standing up the A64FX CI testing I encountered this test failure. Can you investigate?

Snippet of ctest output:

4: [       OK ] serial.sparse_spmv_mv_struct_double_int_size_t_LayoutLeft_TestExecSpace (1 ms)
4: [ RUN      ] serial.sparse_spmv_mv_kokkos_complex_double_int_int_LayoutLeft_TestExecSpace
4: KokkosSparse::Test::spmv_mv: 200 errors of 200 for mv 12 (alpha=(2.5,0), beta=(0,0), mode = N)
4: /path/to/workspace/KokkosKernels_PullRequest_Tpls_ARMPL2110_Tpls_ARMPL2030_GCC1020/kokkos-kernels/unit_test/sparse/Test_Sparse_spmv.hpp:214: Failure
4: Value of: num_errors == 0
4:   Actual: false
4: Expected: true
4: KokkosSparse::Test::spmv_mv: 200 errors of 200 for mv 12 (alpha=(2.5,0), beta=(1,0), mode = N)
4: /path/to/workspace/KokkosKernels_PullRequest_Tpls_ARMPL2110_Tpls_ARMPL2030_GCC1020/kokkos-kernels/unit_test/sparse/Test_Sparse_spmv.hpp:214: Failure
4: Value of: num_errors == 0
4:   Actual: false
4: Expected: true
4: KokkosSparse::Test::spmv_mv: 200 errors of 200 for mv 12 (alpha=(2.5,0), beta=(-1,0), mode = N)
4: /path/to/workspace/KokkosKernels_PullRequest_Tpls_ARMPL2110_Tpls_ARMPL2030_GCC1020/kokkos-kernels/unit_test/sparse/Test_Sparse_spmv.hpp:214: Failure
4: Value of: num_errors == 0
4:   Actual: false
4: Expected: true
4: KokkosSparse::Test::spmv_mv: 200 errors of 200 for mv 12 (alpha=(2.5,0), beta=(2.5,0), mode = N)
4: /path/to/workspace/KokkosKernels_PullRequest_Tpls_ARMPL2110_Tpls_ARMPL2030_GCC1020/kokkos-kernels/unit_test/sparse/Test_Sparse_spmv.hpp:214: Failure
4: Value of: num_errors == 0
4:   Actual: false
4: Expected: true
4: KokkosSparse::Test::spmv_mv: 200 errors of 200 for mv 12 (alpha=(2.5,0), beta=(0,0), mode = C)
4: /path/to/workspace/KokkosKernels_PullRequest_Tpls_ARMPL2110_Tpls_ARMPL2030_GCC1020/kokkos-kernels/unit_test/sparse/Test_Sparse_spmv.hpp:214: Failure
4: Value of: num_errors == 0
4:   Actual: false
4: Expected: true
4: KokkosSparse::Test::spmv_mv: 200 errors of 200 for mv 12 (alpha=(2.5,0), beta=(1,0), mode = C)
4: /path/to/workspace/KokkosKernels_PullRequest_Tpls_ARMPL2110_Tpls_ARMPL2030_GCC1020/kokkos-kernels/unit_test/sparse/Test_Sparse_spmv.hpp:214: Failure
4: Value of: num_errors == 0
4:   Actual: false
4: Expected: true
4: KokkosSparse::Test::spmv_mv: 200 errors of 200 for mv 12 (alpha=(2.5,0), beta=(-1,0), mode = C)
4: /path/to/workspace/KokkosKernels_PullRequest_Tpls_ARMPL2110_Tpls_ARMPL2030_GCC1020/kokkos-kernels/unit_test/sparse/Test_Sparse_spmv.hpp:214: Failure
4: Value of: num_errors == 0
4:   Actual: false
4: Expected: true
4: KokkosSparse::Test::spmv_mv: 200 errors of 200 for mv 12 (alpha=(2.5,0), beta=(2.5,0), mode = C)
4: /path/to/workspace/KokkosKernels_PullRequest_Tpls_ARMPL2110_Tpls_ARMPL2030_GCC1020/kokkos-kernels/unit_test/sparse/Test_Sparse_spmv.hpp:214: Failure
4: Value of: num_errors == 0
4:   Actual: false
4: Expected: true
 4/17 Test  #4: sparse_serial ....................***Exception: SegFault 85.38 sec

Reproducer instructions

cd kokkos/
git checkout -f 0d19eebfa26d076f551d5b7a43230f627887df21
cd ../kokkos-kernels/
git checkout -f f5d7490dee7751a5a3cff8242e7de9f6ad6fe5b2
cd ../
mkdir testing
cd testing/
../kokkos-kernels/scripts/cm_test_all_sandia --spot-check-tpls armpl/21.1.0 --with-tpls=armpl --kokkos-path=../kokkos --kokkoskernels-path=../kokkos-kernels

Note: This is reproducible with both OMP_NUM_THREADS=48 and 47.

Note that this only occurs in the spmv_mv_heavy test. See https://github.com/kokkos/kokkos-kernels/pull/1555/files#diff-451dcf2546f551c9894dd3e3820ba37ea8765ba3ae8de9fc31e04f248910fde2R568.

e10harvey avatar Feb 17 '22 16:02 e10harvey

CC: @jgfouca

e10harvey avatar Mar 08 '22 20:03 e10harvey

@lucbv: Are there any updates on this? Once this is resolved, we can enable Armpl CI checks and improve our code coverage.

e10harvey avatar Mar 31 '22 17:03 e10harvey

@e10harvey sorry it took quite long, PR #1412 might have fixed these, at least I hope it did even though I did not build and test on Inouye. Let me know you see some improvement?

lucbv avatar May 19 '22 01:05 lucbv

Great! It's running now : )

e10harvey avatar May 19 '22 17:05 e10harvey

@lucbv: We're still seeing these errors with OMP_NUM_THREADS=48. See this console output for details. It may be worth testing with OMP_NUM_THREADS=47. I will disable spmv_mv on armpl for now so we can start protecting against regressions.

e10harvey avatar May 19 '22 17:05 e10harvey