NuRaft icon indicating copy to clipboard operation
NuRaft copied to clipboard

Data race in peer

Open as51340 opened this issue 11 months ago • 3 comments

I found data race using TSAN in my code. The issue seems to be in peer.

WARNING: ThreadSanitizer: data race (pid=182953)
  Write of size 8 at 0x7b5800000e50 by thread T5 (mutexes: write M2013, write M1987):
    #0 operator delete(void*) /root/memgraph/environment/toolchain/build/llvm-13.0.0/projects/compiler-rt/lib/tsan/rtl/tsan_new_delete.cpp:126 (coordinator_instance+0x2c2d215)
    #1 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/bits/shared_ptr_base.h:184 (coordinator_instance+0x486baa4)
    #2 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count() /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/bits/shared_ptr_base.h:705 (coordinator_instance+0x486baa4)
    #3 std::__shared_ptr<nuraft::peer, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr() /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/bits/shared_ptr_base.h:1154 (coordinator_instance+0x486baa4)
    #4 std::__shared_ptr<nuraft::peer, (__gnu_cxx::_Lock_policy)2>::reset() /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/bits/shared_ptr_base.h:1272 (coordinator_instance+0x486baa4)
    #5 nuraft::raft_server::reconfigure(std::shared_ptr<nuraft::cluster_config> const&) /home/andi/Memgraph/code/memgraph/libs/nuraft/src/handle_commit.cxx:717 (coordinator_instance+0x486baa4)

  Previous atomic read of size 1 at 0x7b5800000e50 by thread T4:
    #0 pthread_mutex_lock /root/memgraph/environment/toolchain/build/llvm-13.0.0/projects/compiler-rt/lib/tsan/../sanitizer_common/sanitizer_common_interceptors.inc:4249 (coordinator_instance+0x2ba6914)
    #1 __gthread_mutex_lock(pthread_mutex_t*) /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/x86_64-linux-gnu/bits/gthr-default.h:749 (coordinator_instance+0x4894ad7)
    #2 std::mutex::lock() /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/bits/std_mutex.h:100 (coordinator_instance+0x4894ad7)
    #3 std::lock_guard<std::mutex>::lock_guard(std::mutex&) /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/bits/std_mutex.h:229 (coordinator_instance+0x4894ad7)
    #4 nuraft::timer_helper::set_duration_us(unsigned long) /home/andi/Memgraph/code/memgraph/libs/nuraft/include/libnuraft/internal_timer.hxx:57 (coordinator_instance+0x4894ad7)
    #5 nuraft::timer_helper::set_duration_ms(unsigned long) /home/andi/Memgraph/code/memgraph/libs/nuraft/include/libnuraft/internal_timer.hxx:62 (coordinator_instance+0x4894ad7)
    #6 nuraft::peer::handle_rpc_result(std::shared_ptr<nuraft::peer>, std::shared_ptr<nuraft::rpc_client>, std::shared_ptr<nuraft::req_msg>&, std::shared_ptr<nuraft::cmd_result<std::shared_ptr<nuraft::resp_msg>, std::shared_ptr<nuraft::rpc_exception> > >&, std::shared_ptr<nuraft::resp_msg>&, std::shared_ptr<nuraft::rpc_exception>&) /home/andi/Memgraph/code/memgraph/libs/nuraft/src/peer.cxx:146 (coordinator_instance+0x4894ad7)

  Mutex M2013 (0x7b6c00001298) created at:
    #0 pthread_mutex_trylock /root/memgraph/environment/toolchain/build/llvm-13.0.0/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:1296 (coordinator_instance+0x2b9a324)
    #1 __gthread_mutex_trylock(pthread_mutex_t*) /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/x86_64-linux-gnu/bits/gthr-default.h:758 (coordinator_instance+0x4867b9f)
    #2 std::mutex::try_lock() /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/bits/std_mutex.h:111 (coordinator_instance+0x4867b9f)
    #3 std::unique_lock<std::mutex>::unique_lock(std::mutex&, std::try_to_lock_t) /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/bits/unique_lock.h:78 (coordinator_instance+0x4867b9f)
    #4 nuraft::raft_server::commit_in_bg_exec(unsigned long) /home/andi/Memgraph/code/memgraph/libs/nuraft/src/handle_commit.cxx:161 (coordinator_instance+0x4867b9f)

  Mutex M1987 (0x7b6c000011f0) created at:
    #0 pthread_mutex_lock /root/memgraph/environment/toolchain/build/llvm-13.0.0/projects/compiler-rt/lib/tsan/../sanitizer_common/sanitizer_common_interceptors.inc:4249 (coordinator_instance+0x2ba6914)
    #1 __gthread_mutex_lock(pthread_mutex_t*) /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/x86_64-linux-gnu/bits/gthr-default.h:749 (coordinator_instance+0x4880979)
    #2 __gthread_recursive_mutex_lock(pthread_mutex_t*) /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/x86_64-linux-gnu/bits/gthr-default.h:811 (coordinator_instance+0x4880979)
    #3 std::recursive_mutex::lock() /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/mutex:108 (coordinator_instance+0x4880979)
    #4 std::unique_lock<std::recursive_mutex>::lock() /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/bits/unique_lock.h:139 (coordinator_instance+0x4880979)
    #5 std::unique_lock<std::recursive_mutex>::unique_lock(std::recursive_mutex&) /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/bits/unique_lock.h:69 (coordinator_instance+0x4880979)
    #6 nuraft::raft_server::restart_election_timer() /home/andi/Memgraph/code/memgraph/libs/nuraft/src/handle_timeout.cxx:164 (coordinator_instance+0x4880979)
    #7 memgraph::coordination::RaftState::MakeRaftState(memgraph::coordination::CoordinatorInstanceInitConfig const&, std::function<void ()>&&, std::function<void ()>&&) /home/andi/Memgraph/code/memgraph/src/coordination/raft_state.cpp:104 (coordinator_instance+0x2dc1ace)
    #8 memgraph::coordination::CoordinatorInstance::CoordinatorInstance(memgraph::coordination::CoordinatorInstanceInitConfig const&) /home/andi/Memgraph/code/memgraph/src/coordination/coordinator_instance.cpp:36 (coordinator_instance+0x2cb89eb)
    #9 CoordinatorInstanceTest_ConnectCoordinators_Test::TestBody() /home/andi/Memgraph/code/memgraph/tests/unit/coordinator_instance.cpp:67 (coordinator_instance+0x2c30777)
    #10 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) ??:? (coordinator_instance+0x2fa05b5)
    #11 main /home/andi/Memgraph/code/memgraph/tests/unit/main.cpp:19 (coordinator_instance+0x2cb83be)

  Thread T5 'nuraft_commit' (tid=182959, running) created by main thread at:
    #0 pthread_create /root/memgraph/environment/toolchain/build/llvm-13.0.0/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:977 (coordinator_instance+0x2bbb63a)
    #1 std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) /root/memgraph/environment/toolchain/build/gcc-11.2.0/build/x86_64-linux-gnu/libstdc++-v3/include/x86_64-linux-gnu/bits/gthr-default.h:663 (coordinator_instance+0x60720b9)
    #2 std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) /root/memgraph/environment/toolchain/build/gcc-11.2.0/build/x86_64-linux-gnu/libstdc++-v3/src/c++11/../../../../../libstdc++-v3/src/c++11/thread.cc:147 (coordinator_instance+0x60720b9)
    #3 memgraph::coordination::RaftState::MakeRaftState(memgraph::coordination::CoordinatorInstanceInitConfig const&, std::function<void ()>&&, std::function<void ()>&&) /home/andi/Memgraph/code/memgraph/src/coordination/raft_state.cpp:104 (coordinator_instance+0x2dc1ace)
    #4 memgraph::coordination::CoordinatorInstance::CoordinatorInstance(memgraph::coordination::CoordinatorInstanceInitConfig const&) /home/andi/Memgraph/code/memgraph/src/coordination/coordinator_instance.cpp:36 (coordinator_instance+0x2cb89eb)
    #5 CoordinatorInstanceTest_ConnectCoordinators_Test::TestBody() /home/andi/Memgraph/code/memgraph/tests/unit/coordinator_instance.cpp:67 (coordinator_instance+0x2c30777)
    #6 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) ??:? (coordinator_instance+0x2fa05b5)
    #7 main /home/andi/Memgraph/code/memgraph/tests/unit/main.cpp:19 (coordinator_instance+0x2cb83be)

  Thread T4 'nuraft_w_0' (tid=182958, running) created by main thread at:
    #0 pthread_create /root/memgraph/environment/toolchain/build/llvm-13.0.0/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:977 (coordinator_instance+0x2bbb63a)
    #1 __gthread_create /root/memgraph/environment/toolchain/build/gcc-11.2.0/build/x86_64-linux-gnu/libstdc++-v3/include/x86_64-linux-gnu/bits/gthr-default.h:663 (coordinator_instance+0x60720b9)
    #2 std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) /root/memgraph/environment/toolchain/build/gcc-11.2.0/build/x86_64-linux-gnu/libstdc++-v3/src/c++11/../../../../../libstdc++-v3/src/c++11/thread.cc:147 (coordinator_instance+0x60720b9)
    #3 memgraph::coordination::RaftState::MakeRaftState(memgraph::coordination::CoordinatorInstanceInitConfig const&, std::function<void ()>&&, std::function<void ()>&&) /home/andi/Memgraph/code/memgraph/src/coordination/raft_state.cpp:104 (coordinator_instance+0x2dc1ace)
    #4 memgraph::coordination::CoordinatorInstance::CoordinatorInstance(memgraph::coordination::CoordinatorInstanceInitConfig const&) /home/andi/Memgraph/code/memgraph/src/coordination/coordinator_instance.cpp:36 (coordinator_instance+0x2cb89eb)
    #5 CoordinatorInstanceTest_ConnectCoordinators_Test::TestBody() /home/andi/Memgraph/code/memgraph/tests/unit/coordinator_instance.cpp:67 (coordinator_instance+0x2c30777)
    #6 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) ??:? (coordinator_instance+0x2fa05b5)
    #7 main /home/andi/Memgraph/code/memgraph/tests/unit/main.cpp:19 (coordinator_instance+0x2cb83be)

SUMMARY: ThreadSanitizer: data race /root/memgraph/environment/toolchain/build/llvm-13.0.0/projects/compiler-rt/lib/tsan/rtl/tsan_new_delete.cpp:126 in operator delete(void*)
=

Relevant code that is using NuRaft here is related to adding servers, we do:

srv_config const srv_config_to_add(static_cast<int>(config.coordinator_id), endpoint);

  auto cmd_result = raft_server_->add_srv(srv_config_to_add);

  if (cmd_result->get_result_code() == nuraft::cmd_result_code::OK) {
    spdlog::info("Request to add server {} to the cluster accepted", endpoint);
  } else {
    throw RaftAddServerException("Failed to accept request to add server {} to the cluster with error code {}",
                                 endpoint, int(cmd_result->get_result_code()));
  }

  // Waiting for server to join
  constexpr int max_tries{10};
  auto maybe_stop = utils::ResettableCounter<max_tries>();
  constexpr int waiting_period{200};
  bool added{false};
  while (!maybe_stop()) {
    std::this_thread::sleep_for(std::chrono::milliseconds(waiting_period));
    const auto server_config = raft_server_->get_srv_config(static_cast<nuraft::int32>(config.coordinator_id));
    if (server_config) {
      spdlog::trace("Server with id {} added to cluster", config.coordinator_id);
      added = true;
      break;
    }
  }

  if (!added) {
    throw RaftAddServerException("Failed to add server {} to the cluster in {}ms", endpoint,
                                 max_tries * waiting_period);
  }

as51340 avatar Mar 21 '24 06:03 as51340

From what you provided, it seems there is a race between srv_to_join_.reset() and peer::lock_. However, that is not possible as peer::handle_rpc_result has myself not to make the reference counter of the shared pointer become 0. I haven't seen TSAN alert during the normal server add/removal process.

  • Can you please provide the exact commit hash of NuRaft that you used?
  • Also, do you get this TSAN alert every time you run the above code? If so, it will be great to share the logs generated by NuRaft to see what happened.

greensky00 avatar Mar 25 '24 07:03 greensky00

Thanks for the reply!

We are using NuRaft v2.1.0 Yes, TSAN alerts on every run.

Can you instruct me on how to get logs?

as51340 avatar Mar 28 '24 10:03 as51340

I tried the same code with the same NuRaft version, still didn't get TSAN alert.

Could you share your build environment? Such as OS version, compiler version, etc. Also please share Raft parameters (heartbeat, election lower/upper timeout, ...).

greensky00 avatar Apr 10 '24 06:04 greensky00