Kokkos::Threads backend is broken on M2 MacBook (ARM) in develop
It still works in 4.3.01 but in develop a lot of the tests segfault when launched with more than one thread (export KOKKOS_NUM_THREADS=4).
For example:
s1087574:build crtrott$ containers/unit_tests/Kokkos_ContainersUnitTest_Threads
[==========] Running 65 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 65 tests from threads
[ RUN ] threads.bitset
Segmentation fault: 11
Configured:
cmake -DKokkos_ENABLE_THREADS=ON -DKokkos_ENABLE_TESTS=ON
Were you able to reproduce on another platform or is it Mac specific?
My gut feeling is that it's ARM specific, maybe Power (i.e. relaxed memory semantics), haven't explicitly tried though.
I was able to reproduce on M1 Pro. It looks indeterministic. Note that I did get a segfault on 4.3.01 after re-running a couple times. (Just a data point, I am not investigating further for the time being)
For the mentioned configuration and architecture, a variety of tests is failing if running a few times, incl unit tests in core. Are we not catching this in CI?
For the mentioned configuration and architecture, a variety of tests is failing if running a few times, incl unit tests in core. Are we not catching this in CI?
What are you seeing? For reference, we are using hwloc in the CI with the Threads backend (https://github.com/kokkos/kokkos/blob/develop/.github/workflows/continuous-integration-workflow.yml) so we are initializing with multiple threads.
17: [ RUN ] THREADS.incr_14_MDrangeReduce
17: /Users/6da/Software/kokkos/core/unit_test/incremental/Test14_MDRangeReduce.hpp:115: Failure
17: Expected equality of these values:
17: h_result
17: Which is: 1012.5
17: d_result
17: Which is: 720
Is what I'm seeing locally on my Mac M1 with #7402.
Anyone actually checked that this does not occur say on X86?
Anyone actually checked that this does not occur say on X86?
That's what we have in our CI if I'm not mistaken.
Which build are you referring to? The GitHub workflow builds have access to few cores and AFAICT we don't specify how many threads to use at runtime which means we run on a single thread.
we don't specify how many threads to use at runtime which means we run on a single thread.
But we use hwloc which means we are using multiple threads, see https://github.com/kokkos/kokkos/pull/5109.
we use
hwloc
Nevermind then
Do you manage to reproduce the error on your machine with only 2 threads?
With 2 threads everything seems to pass. It starts failing with >=4.
Quick update: This does not manifest on ARMv9. Manifests on Apple M1 and M2. Our atomics seems to be ok, changing our atomics to enforce sequential consistency has no effect. I did run lldb on my M1 and it errors out in L115, regardless unit test. I've added the assert before L115 to show the issue:
[ RUN ] threads.dispatch
Assertion failed: (s_current_function != NULL), function driver, file Kokkos_Threads_Instance.cpp, line 114.
Process 17766 stopped
* thread #6, stop reason = hit program assert
frame #4: 0x0000000101895dbc Kokkos_CoreUnitTest_Threads`Kokkos::Impl::ThreadsInternal::driver() at Kokkos_Threads_Instance.cpp:114:5
111 ThreadsInternal this_thread;
112
113 while (this_thread.m_pool_state == ThreadState::Active) {
-> 114 assert(s_current_function != NULL);
115 (*s_current_function)(this_thread, s_current_function_arg);
116
117 // Deactivate thread and wait for reactivation
Target 0: (Kokkos_CoreUnitTest_Threads) stopped.
(lldb) where
error: 'where' is not a valid command.
(lldb) bt
* thread #6, stop reason = hit program assert
frame #0: 0x000000018fa76a60 libsystem_kernel.dylib`__pthread_kill + 8
frame #1: 0x000000018faaec20 libsystem_pthread.dylib`pthread_kill + 288
frame #2: 0x000000018f9bba20 libsystem_c.dylib`abort + 180
frame #3: 0x000000018f9bad10 libsystem_c.dylib`__assert_rtn + 284
* frame #4: 0x0000000101895dbc Kokkos_CoreUnitTest_Threads`Kokkos::Impl::ThreadsInternal::driver() at Kokkos_Threads_Instance.cpp:114:5
frame #5: 0x00000001018978ec Kokkos_CoreUnitTest_Threads`Kokkos::Impl::(anonymous namespace)::internal_cppthread_driver() at Kokkos_Threads_Instance.cpp:49:5
frame #6: 0x00000001018989b4 Kokkos_CoreUnitTest_Threads`decltype(std::declval<void (*)()>()()) std::__1::__invoke[abi:ue170006]<void (*)()>(__f=0x0000600000fe18a8) at invoke.h:340:25
frame #7: 0x000000010189894c Kokkos_CoreUnitTest_Threads`void std::__1::__thread_execute[abi:ue170006]<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (*)()>(__t=size=2, (null)=__tuple_indices<> @ 0x00000001700b6f7f) at thread.h:227:5
frame #8: 0x00000001018982b0 Kokkos_CoreUnitTest_Threads`void* std::__1::__thread_proxy[abi:ue170006]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (*)()>>(__vp=0x0000600000fe18a0) at thread.h:238:5
frame #9: 0x000000018faaef94 libsystem_pthread.dylib`_pthread_start + 136
diff --git a/core/src/Threads/Kokkos_Threads_Instance.cpp b/core/src/Threads/Kokkos_Threads_Instance.cpp
index df6612bf9..5c898aa29 100644
--- a/core/src/Threads/Kokkos_Threads_Instance.cpp
+++ b/core/src/Threads/Kokkos_Threads_Instance.cpp
@@ -111,7 +111,8 @@ void ThreadsInternal::driver() {
ThreadsInternal this_thread;
while (this_thread.m_pool_state == ThreadState::Active) {
- (*s_current_function)(this_thread, s_current_function_arg);
+ auto* my_function_ptr = Kokkos::atomic_load(&s_current_function);
+ (*my_function_ptr)(this_thread, s_current_function_arg);
// Deactivate thread and wait for reactivation
this_thread.m_pool_state = ThreadState::Inactive;
fixes it for me. I think it's safe to assume that declaring the function pointer as volatile doesn't have the desired effect.
Seems like it. Might be related to a specific clang version. My Mac runs 15. I'll install something newer and report back.
Tested with clang version 19. The issue persists also when using the atomic load on s_current_function as mentioned above.