kokkos icon indicating copy to clipboard operation
kokkos copied to clipboard

tests: enable random shuffling on a subset of the CI

Open romintomasetti opened this issue 1 year ago • 20 comments

Tackling #7435.

romintomasetti avatar Oct 15 '24 02:10 romintomasetti

13: [ RUN      ] defaultdevicetype.reduce_instantiation_c2
13/57 Test #13: Kokkos_CoreUnitTest_Default ................................***Exception: SegFault  0.97 sec

in CUDA RDC build

dalg24 avatar Oct 15 '24 11:10 dalg24

13: [ RUN      ] defaultdevicetype.reduce_instantiation_c2
13/57 Test #13: Kokkos_CoreUnitTest_Default ................................***Exception: SegFault  0.97 sec

in CUDA RDC build

I could reproduce, using the seed 10003. It fails all the time (I repeated the test 5 times on my laptop).

Filtering the test cases with --gtest_filter=*reduce_instantiation_c2* (ensuring I'm running that failing test only) also triggers the seg fault.

romintomasetti avatar Oct 16 '24 13:10 romintomasetti

Here this might help to at least build again: -Xnvlink --suppress-stack-size-warning

See: https://github.com/kokkos/kokkos/issues/2039

crtrott avatar Oct 16 '24 19:10 crtrott

I am all in favor of this once we resolve that failure

dalg24 avatar Oct 17 '24 22:10 dalg24

Just managed to get a trace with `cuda-gdb`
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from defaultdevicetype
[ RUN      ] defaultdevicetype.reduce_instantiation_c2

Thread 1 "Kokkos_CoreUnit" received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(cuda-gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x0000555555643d96 in __nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(int, void*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::RangePolicy<Kokkos::Cuda>), &(void Test::TestReduceCombinatoricalInstantiation<Kokkos::Cuda>::AddLambdaRange<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::RangePolicy<Kokkos::Cuda> >(int, void*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::RangePolicy<Kokkos::Cuda>)), 1u>, void (int const&, double&)>::__nv_hdl_wrapper_t(__nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(int, void*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::RangePolicy<Kokkos::Cuda>), &(void Test::TestReduceCombinatoricalInstantiation<Kokkos::Cuda>::AddLambdaRange<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::RangePolicy<Kokkos::Cuda> >(int, void*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::RangePolicy<Kokkos::Cuda>)), 1u>, void (int const&, double&)> const&) (in=..., this=<optimized out>) at nvcc_internal_extended_lambda_implementation:236
#2  Kokkos::Impl::FunctorAnalysis<Kokkos::Impl::FunctorPatternInterface::REDUCE, Kokkos::RangePolicy<Kokkos::Cuda>, __nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(int, void*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::RangePolicy<Kokkos::Cuda>), &(void Test::TestReduceCombinatoricalInstantiation<Kokkos::Cuda>::AddLambdaRange<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::RangePolicy<Kokkos::Cuda> >(int, void*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::RangePolicy<Kokkos::Cuda>)), 1u>, void (int const&, double&)>, double>::Reducer::Reducer(__nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(int, void*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::RangePolicy<Kokkos::Cuda>), &(void Test::TestReduceCombinatoricalInstantiation<Kokkos::Cuda>::AddLambdaRange<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::RangePolicy<Kokkos::Cuda> >(int, void*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::RangePolicy<Kokkos::Cuda>)), 1u>, void (int const&, double&)> const&) (arg_functor=..., this=<optimized out>)
    at /workspaces/kokkos/build-with-cuda-11.0-nvcc-rdc-install/install/include/impl/Kokkos_FunctorAnalysis.hpp:995
#3  Kokkos::Impl::ParallelReduceAdaptor<Kokkos::RangePolicy<Kokkos::Cuda>, __nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(int, void*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::RangePolicy<Kokkos::Cuda>), &(void Test::TestReduceCombinatoricalInstantiation<Kokkos::Cuda>::AddLambdaRange<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::RangePolicy<Kokkos::Cuda> >(int, void*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::RangePolicy<Kokkos::Cuda>)), 1u>, void (int const&, double&)>, double>::execute_impl(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, Kokkos::RangePolicy<Kokkos::Cuda> const&, __nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(int, void*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::RangePolicy<Kokkos::Cuda>), &(void Test::TestReduceCombinatoricalInstantiation<Kokkos::Cuda>::AddLambdaRange<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::RangePolicy<Kokkos::Cuda> >(int, void*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::RangePolicy<Kokkos::Cuda>)), 1u>, void (int const&, double&)> const&, double&) (return_value=@0x7fffffffd388: 99, functor=..., policy=..., label=...)
    at /workspaces/kokkos/build-with-cuda-11.0-nvcc-rdc-install/install/include/Kokkos_Parallel_Reduce.hpp:1525
#4  Kokkos::Impl::ParallelReduceAdaptor<Kokkos::RangePolicy<Kokkos::Cuda>, __nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(int, void*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::RangePolicy<Kokkos::Cuda>), &(void Test::TestReduceCombinatoricalInstantiation<Kokkos::Cuda>::AddLambdaRange<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::RangePolicy<Kokkos::Cuda> >(int, void*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::RangePolicy<Kokkos::Cuda>)), 1u>, void (int const&, double&)>, double>::execute<double>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, Kokkos::RangePolicy<Kokkos::Cuda> const&, __nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(int, void*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::RangePolicy<Kokkos::Cuda>), &(void Test::TestReduceCombinatoricalInstantiation<Kokkos::Cuda>::AddLambdaRange<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::RangePolicy<Kokkos::Cuda> >(int, void*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::RangePolicy<Kokkos::Cuda>)), 1u>, void (int const&, double&)> const&, double&) (return_value=@0x7fffffffd388: 99, functor=..., policy=..., label=...)
    at /workspaces/kokkos/build-with-cuda-11.0-nvcc-rdc-install/install/include/Kokkos_Parallel_Reduce.hpp:1555
#5  Kokkos::parallel_reduce<Kokkos::RangePolicy<Kokkos::Cuda>, __nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(int, void*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::RangePolicy<Kokkos::Cuda>), &(void Test::TestReduceCombinatoricalInstantiation<Kokkos::Cuda>::AddLambdaRange<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::RangePolicy<Kokkos::Cuda> >(int, void*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::RangePolicy<Kokkos::Cuda>)), 1u>, void (int const&, double&)>, double>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, Kokkos::RangePolicy<Kokkos::Cuda> const&, __nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(int, void*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::RangePolicy<Kokkos::Cuda>), &(void Test::TestReduceCombinatoricalInstantiation<Kokkos::Cuda>::AddLambdaRange<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::RangePolicy<Kokkos::Cuda> >(int, void*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::RangePolicy<Kokkos::Cuda>)), 1u>, void (int const&, double&)> const&--Type <RET> for more, q to quit, c to continue without paging--
, double&) (return_value=@0x7fffffffd388: 99, functor=..., policy=..., label=...)
    at /workspaces/kokkos/build-with-cuda-11.0-nvcc-rdc-install/install/include/Kokkos_Parallel_Reduce.hpp:1687
#6  Test::TestReduceCombinatoricalInstantiation<Kokkos::Cuda>::AddReturnArgument<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::RangePolicy<Kokkos::Cuda>, __nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(int, void*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::RangePolicy<Kokkos::Cuda>), &(void Test::TestReduceCombinatoricalInstantiation<Kokkos::Cuda>::AddLambdaRange<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::RangePolicy<Kokkos::Cuda> >(int, void*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::RangePolicy<Kokkos::Cuda>)), 1u>, void (int const&, double&)> >(int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::RangePolicy<Kokkos::Cuda>, __nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(int, void*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::RangePolicy<Kokkos::Cuda>), &(void Test::TestReduceCombinatoricalInstantiation<Kokkos::Cuda>::AddLambdaRange<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::RangePolicy<Kokkos::Cuda> >(int, void*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::RangePolicy<Kokkos::Cuda>)), 1u>, void (int const&, double&)>) (N=1000)
    at /workspaces/kokkos/core/unit_test/TestReduceCombinatorical.hpp:353
#7  0x00005555556d8732 in Test::TestReduceCombinatoricalInstantiation<Kokkos::Cuda>::AddLambdaRange<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::RangePolicy<Kokkos::Cuda> > (N=1000) at /usr/include/c++/9/bits/basic_string.h:940
#8  Test::TestReduceCombinatoricalInstantiation<Kokkos::Cuda>::AddFunctorLambdaRange<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::RangePolicy<Kokkos::Cuda> > (N=1000) at /workspaces/kokkos/core/unit_test/TestReduceCombinatorical.hpp:495
#9  Test::TestReduceCombinatoricalInstantiation<Kokkos::Cuda>::AddPolicy_2<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > (
    N=N@entry=1000) at /workspaces/kokkos/core/unit_test/TestReduceCombinatorical.hpp:523
#10 0x00005555556b4afc in Test::TestReduceCombinatoricalInstantiation<Kokkos::Cuda>::execute_c2 () at /usr/include/c++/9/bits/char_traits.h:300
#11 Test::defaultdevicetype_reduce_instantiation_c2_Test::TestBody (this=<optimized out>)
    at /workspaces/kokkos/core/unit_test/default/TestDefaultDeviceType_c2.cpp:29
#12 0x0000555555880051 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (location=0x5555558f2905 "the test body", 
    method=<optimized out>, object=0x555559f42f40) at /workspaces/kokkos/tpls/gtest/gtest/gtest-all.cc:4082
#13 testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> (object=object@entry=0x555559f42f40, method=<optimized out>, 
    location=location@entry=0x5555558f2905 "the test body") at /workspaces/kokkos/tpls/gtest/gtest/gtest-all.cc:4137
#14 0x0000555555871440 in testing::Test::Run (this=this@entry=0x555559f42f40) at /workspaces/kokkos/tpls/gtest/gtest/gtest-all.cc:4176
#15 0x00005555558718d5 in testing::Test::Run (this=0x555559f42f40) at /workspaces/kokkos/tpls/gtest/gtest/gtest-all.cc:4326
#16 testing::TestInfo::Run (this=0x5555577b5180) at /workspaces/kokkos/tpls/gtest/gtest/gtest-all.cc:4326
#17 0x0000555555872031 in testing::TestInfo::Run (this=<optimized out>) at /workspaces/kokkos/tpls/gtest/gtest/gtest-all.cc:4299
#18 testing::TestSuite::Run (this=0x5555577b1910) at /workspaces/kokkos/tpls/gtest/gtest/gtest-all.cc:4480
#19 0x00005555558737b9 in testing::TestSuite::Run (this=<optimized out>) at /workspaces/kokkos/tpls/gtest/gtest/gtest-all.cc:4459
#20 testing::internal::UnitTestImpl::RunAllTests (this=0x5555577b1540) at /workspaces/kokkos/tpls/gtest/gtest/gtest-all.cc:7320
#21 0x0000555555873ce8 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (
    location=0x5555558f42f0 "auxiliary test code (environments or event listeners)", method=<optimized out>, object=0x5555577b1540)
    at /workspaces/kokkos/tpls/gtest/gtest/gtest-all.cc:4082
#22 testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (
    location=0x5555558f42f0 "auxiliary test code (environments or event listeners)", 
    method=(bool (testing::internal::UnitTestImpl::*)(testing::internal::UnitTestImpl * const)) 0x5555558726d0 <testing::internal::UnitTestImpl::RunAllTests()>, 
    object=0x5555577b1540) at /workspaces/kokkos/tpls/gtest/gtest/gtest-all.cc:4137
#23 testing::UnitTest::Run (this=<optimized out>) at /workspaces/kokkos/tpls/gtest/gtest/gtest-all.cc:6903
#24 0x000055555557b217 in RUN_ALL_TESTS () at /workspaces/kokkos/tpls/gtest/gtest/gtest.h:12371
#25 main (argc=<optimized out>, argv=0x7fffffffda98) at /workspaces/kokkos/core/unit_test/UnitTestMainInit.cpp:26

romintomasetti avatar Oct 19 '24 03:10 romintomasetti

	  3 - Kokkos_CoreUnitTest_HIP (Timeout)

looks suspicious.

masterleinad avatar Oct 19 '24 12:10 masterleinad

Retest this please.

masterleinad avatar Oct 19 '24 12:10 masterleinad

Looks like it timed out again ...

crtrott avatar Oct 21 '24 15:10 crtrott

Reproduced the CUDA fail. It's only there in release mode, not debug. Testing with different cuda versions to see if it's specific to 11.0

Also, seems the last commit fixed the issue for cuda-11.0-RDC build, but not the cuda-11.0.3-clang-tidy one.

tcclevenger avatar Oct 21 '24 22:10 tcclevenger

Also, seems the last commit fixed the issue for cuda-11.0-RDC build, but not the cuda-11.0.3-clang-tidy one.

That's just the shared memory test that fails occasionally depending on other stuff running simultaneously.

masterleinad avatar Oct 22 '24 13:10 masterleinad

I can't reproduce the HIP test failure locally. Let's try rerunning the CI one more time.

masterleinad avatar Oct 22 '24 13:10 masterleinad

Retest this please.

masterleinad avatar Oct 22 '24 13:10 masterleinad

That's just the shared memory test that fails occasionally depending on other stuff running simultaneously.

I think the Default test was also failing for that configuration? We will see with this current round of CI.

tcclevenger avatar Oct 22 '24 14:10 tcclevenger

A little more testing,

  • It is an ordering issue among --gtest_filter=*reduce_instantiation* tests.
    • You can find orders of reduce_instantiation tests that pass and fail based on when _c2 is called
    • Still need to find exactly what tests need to be run first
  • I triggered using cuda 12.4, so not a cuda 11.0 specific issue.
  • Release only

tcclevenger avatar Oct 22 '24 18:10 tcclevenger

Retest this please.

janciesko avatar Nov 08 '24 19:11 janciesko

We still need to fix this or?

crtrott avatar Dec 09 '24 17:12 crtrott

Retest this please

dalg24 avatar Jan 10 '25 14:01 dalg24

Retest this please

dalg24 avatar Jan 10 '25 17:01 dalg24

One CUDA build on the ORNL ones failed in OpenMP subview. No error message just failed to build that object file. I think this should be unrelated thoughts?

crtrott avatar Jan 13 '25 15:01 crtrott

I think it's fine.

masterleinad avatar Jan 13 '25 15:01 masterleinad

Retest this please

dalg24 avatar Sep 23 '25 18:09 dalg24