rmw_fastrtps icon indicating copy to clipboard operation
rmw_fastrtps copied to clipboard

Segfault in MultiThreadedExecutor using `rmw_fastrtps_cpp`

Open sea-bass opened this issue 2 years ago • 11 comments
trafficstars

Bug report

Required Info:

  • Operating System:
    • Ubuntu 22.04 (inside a Docker container)
  • Installation type:
    • Binaries
  • Version or commit hash:
    • The latest available binaries on Humble
  • DDS implementation:
    • rmw_fastrtps_cpp
  • Client library (if applicable):
    • rclcpp

Steps to reproduce issue

We unfortunately don't have a full reproduction case, but we are creating a node with a multi-threaded executor as follows:

rclcpp::executors::MultiThreadedExecutor exec(rclcpp::ExecutorOptions(), 0, false, std::chrono::milliseconds(250));

The segfault we see is:

[objective_server_node_main-8] [ERROR] [1684327668.068245959] [objective_server_node]: Execution of behavior tree for objective `3 Waypoints Pick and Place` did not succeed.
[objective_server_node_main-8] Stack trace (most recent call last) in thread 833:
[objective_server_node_main-8] #8    Object "/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1", at 0xffffffffffffffff, in 
[objective_server_node_main-8] #7    Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0xffff9e5a5d1b, in 
[objective_server_node_main-8] #6    Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0xffff9e53d5c7, in 
[objective_server_node_main-8] #5    Object "/usr/lib/aarch64-linux-gnu/libstdc++.so.6.0.30", at 0xffff9e7731fb, in 
[objective_server_node_main-8] [1684327680.555]: Sequence                  IDLE -> RUNNING
[objective_server_node_main-8] [1684327680.555]: ClearSnapshot             IDLE -> RUNNING
[objective_server_node_main-8] #4    Source "./src/rclcpp/executors/multi_threaded_executor.cpp", line 85, in run [0xffff9edbffc7]
[objective_server_node_main-8] #3    Source "./src/rclcpp/executor.cpp", line 912, in get_next_executable [0xffff9edb8b5b]
[objective_server_node_main-8] #2    Source "./src/rclcpp/executor.cpp", line 756, in wait_for_work [0xffff9edb5bdf]
[objective_server_node_main-8] #1    Object "/opt/ros/humble/lib/librcl.so", at 0xffff9e185b97, in rcl_wait
[objective_server_node_main-8] #0    Object "/opt/ros/humble/lib/librmw_fastrtps_shared_cpp.so", at 0xffff9d5f5718, in rmw_fastrtps_shared_cpp::__rmw_wait(char const*, rmw_subscriptions_s*, rmw_guard_conditions_s*, rmw_services_s*, rmw_clients_s*, rmw_events_s*, rmw_wait_set_s*, rmw_time_s const*)
[objective_server_node_main-8] Segmentation fault (Address not mapped to object [0xaaa1b04a5f08])
[ERROR] [objective_server_node_main-8]: process has died [pid 666, exit code -11, cmd '/opt/overlay_ws/install/moveit_studio_agent/lib/moveit_studio_agent/objective_server_node_main --ros-args --log-level error --ros-args -r __node:=objective_server_node --params-file /tmp/launch_params_j4cwbvx_ --params-file /tmp/launch_params_7oyj5lbr --params-file /tmp/launch_params_2aqbg8_q --params-file /tmp/launch_params_12q1oav_ --params-file /tmp/launch_params_h4pxzy2y --params-file /tmp/launch_params_8vn2ch2o --params-file /tmp/launch_params_1c4b41ac --params-file /tmp/launch_params_blie9quv'].

One of our engineers looked into this with a debugger and saw:

Thread 25 "objective_serve" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xffffa535e660 (LWP 74792)]
0x0000fffff6417588 in rmw_fastrtps_shared_cpp::__rmw_wait (identifier=, subscriptions=, guard_conditions=0xaaaaab4e0b30, services=, clients=, events=, wait_set=, wait_timeout=) at /opt/underlay_ws/src/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_wait.cpp:220
220 if (!condition->get_trigger_value()) {
(gdb) bt
#0 0x0000fffff6417588 in rmw_fastrtps_shared_cpp::__rmw_wait (
identifier=, subscriptions=,
guard_conditions=0xaaaaab4e0b30, services=,
clients=, events=, wait_set=,
wait_timeout=)
at /opt/underlay_ws/src/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_wait.cpp:220
#1 0x0000fffff6fc5b98 in rcl_wait (wait_set=0xffffffffa110, timeout=250000000)
at ./src/rcl/wait.c:595
#2 0x0000fffff7bf5be0 in rclcpp::Executor::wait_for_work (
this=0xffffffffa060, timeout=...) at /usr/include/c++/11/chrono:521
#3 0x0000fffff7bf8b5c in rclcpp::Executor::get_next_executable (
this=0xffffffffa060, any_executable=..., timeout=...)
at ./src/rclcpp/executor.cpp:912
#4 0x0000fffff7bfffc8 in rclcpp::executors::MultiThreadedExecutor::run (
this=0xffffffffa060, this_thread_number=)
at ./src/rclcpp/executors/multi_threaded_executor.cpp:85
#5 0x0000fffff75b31fc in ?? () from /lib/aarch64-linux-gnu/libstdc++.so.6
#6 0x0000fffff737d5c8 in ?? () from /lib/aarch64-linux-gnu/libc.so.6
#7 0x0000fffff73e5d1c in ?? () from /lib/aarch64-linux-gnu/libc.so.6`

Additional information

This segfault does not occur using Cyclone DDS.

Implementation considerations

sea-bass avatar Oct 25 '23 14:10 sea-bass

Segmentation fault (Address not mapped to object [0xaaa1b04a5f08]) 0x0000fffff6417588 in rmw_fastrtps_shared_cpp::__rmw_wait (identifier=, subscriptions=, guard_conditions=0xaaaaab4e0b30, services=, clients=, events=, wait_set=, wait_timeout=) at /opt/underlay_ws/src/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_wait.cpp:220 220 if (!condition->get_trigger_value()) {

https://github.com/ros2/rmw_fastrtps/blob/247e491e20ed53f1d3b33dfff78a8130c091161c/rmw_fastrtps_shared_cpp/src/rmw_wait.cpp#L220

static_cast<eprosima::fastdds::dds::GuardCondition *>(data) is NULL, that led to the segmentation fault.

rclcpp::executors::MultiThreadedExecutor exec(rclcpp::ExecutorOptions(), 0, false, std::chrono::milliseconds(250));

could you share more conditions when this problem happens so that we can try to reproduce the issue? e.g) do you manage your own rclcpp::GuardCondition?

fujitatomoya avatar Oct 25 '23 16:10 fujitatomoya

CC: @Barry-Xu-2018 @iuhilnehc-ynos

fujitatomoya avatar Oct 25 '23 16:10 fujitatomoya

could you share more conditions when this problem happens so that we can try to reproduce the issue? e.g) do you manage your own rclcpp::GuardCondition?

Unfortunately the code is part of a private code base and we haven't been able to come up with a minimal repro case. But our main is fairly simple:

int main(int argc, char* argv[])
{
  rclcpp::init(argc, argv);
  rclcpp::NodeOptions opt;
  opt.automatically_declare_parameters_from_overrides(true);
  auto node = std::make_shared<OurNodeType>(opt);

  // This's a workaround a bug in MultiThreadedExecutor deadlock when spinning without a timeout
  rclcpp::executors::MultiThreadedExecutor exec(rclcpp::ExecutorOptions(), 0, false, std::chrono::milliseconds(250));
  exec.add_node(node->get_node_base_interface());
  exec.spin();
  exec.remove_node(node->get_node_base_interface());

  rclcpp::shutdown();
}

The node itself has an action server in it, and we notice the segfault occurring -- not consistently, but frequently -- when one action is cancelled and we try start a new one.

Another thing to note is that these actions that execute can (and often do) create their own subscribers, publishers, service/action clients, etc. So maybe that also plays into the problem?

sea-bass avatar Oct 25 '23 16:10 sea-bass

@fujitatomoya

According to current information, the problem is related to https://github.com/ros2/rclcpp/blob/8709146df87ab860c88f03a83f6df7d0347d0085/rclcpp/include/rclcpp/strategies/allocator_memory_strategy.hpp#L494 Here, shared pointer wasn't used. It is used in Executor::wait_for_work() without lock.

It can be removed under the below function

  1. Executor::wait_for_work()
  2. Executor::remove_callback_group_from_map() called by Executor::remove_callback_group() or Executor::remove_node()
  3. Executor::~Executor()

So, while GuardCondition is used in Executor::wait_for_work(), it is possible that it is removed by Executor::remove_callback_group_from_map().
I don't know why current implementation don't use shared pointer to save GuardCondition.
Besides, remove action should be in Executor::wait_for_work() and Executor::~Executor().

Barry-Xu-2018 avatar Oct 28 '23 02:10 Barry-Xu-2018

I'm just coming back to this issue and I think adding locks to the functions in AllocatorMemoryStrategy that touch the guard_conditions_ member variable seems promising. I might have time this week to take a look and land some fixes.

If this works, seems like this file hasn't been modified in a while so I'm hopeful we can backport the fix to humble.

sea-bass avatar Jan 14 '24 23:01 sea-bass

Update: I tried putting locks in both allocator_memory_strategy.cpp and executor.cpp and the chain of FastDDS issues keeps elevating up the chain with more segfaults.

I think the problem is more pervasive than just stuff at the rclcpp layer, sadly.

[planning_context_manager_node-2] terminate called after throwing an instance of 'std::out_of_range'
[planning_context_manager_node-2]   what():  vector::_M_range_check: __n (which is 2674159792) >= this->size() (which is 142)
[planning_context_manager_node-2] Stack trace (most recent call last) in thread 10468:
[move_group-1] double free or corruption (out)
[move_group-1] Stack trace (most recent call last) in thread 10409:
[waypoint_manager_node-4] corrupted size vs. prev_size
[planning_context_manager_node-2] #22   Object "", at 0xffffffffffffffff, in 
[planning_context_manager_node-2] #21   Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc537bdd84f, in 
[planning_context_manager_node-2] #20   Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc537b4bac2, in 
[planning_context_manager_node-2] #19   Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fc537ddc252, in 
[component_container_mt-8] double free or corruption (out)
[planning_context_manager_node-2] #18   Source "/opt/underlay_ws/src/rclcpp/rclcpp/src/rclcpp/signal_handler.cpp", line 271, in deferred_signal_handler [0x7fc53868087a]
[planning_context_manager_node-2]         268:             "deferred_signal_handler(): "
[planning_context_manager_node-2]         269:             "shutting down rclcpp::Context @ %p, because it had shutdown_on_signal == true",
[planning_context_manager_node-2]         270:             static_cast<void *>(context_ptr.get()));
[planning_context_manager_node-2]       > 271:           context_ptr->shutdown("signal handler");
[planning_context_manager_node-2]         272:         }
[planning_context_manager_node-2]         273:       }
[planning_context_manager_node-2]         274:     }
[planning_context_manager_node-2] #17   Source "/opt/underlay_ws/src/rclcpp/rclcpp/src/rclcpp/context.cpp", line 343, in shutdown [0x7fc5385b3ae9]
[planning_context_manager_node-2]         340:     std::lock_guard<std::recursive_mutex> guard(*logging_mutex_);
[planning_context_manager_node-2]         341:     size_t & count = get_logging_reference_count();
[planning_context_manager_node-2]         342:     if (0u == --count) {
[planning_context_manager_node-2]       > 343:       rcl_ret_t rcl_ret = rcl_logging_fini();
[planning_context_manager_node-2]         344:       if (RCL_RET_OK != rcl_ret) {
[planning_context_manager_node-2]         345:         RCUTILS_SAFE_FWRITE_TO_STDERR(
[planning_context_manager_node-2]         346:           RCUTILS_STRINGIFY(__file__) ":"
[planning_context_manager_node-2] #16   Object "/opt/ros/humble/lib/librcl.so", at 0x7fc537f4a1b6, in rcl_logging_fini
[planning_context_manager_node-2] #15   Object "/opt/ros/humble/lib/librcl.so", at 0x7fc537f49dcc, in 
[planning_context_manager_node-2] #14   Object "/opt/ros/humble/lib/librcl.so", at 0x7fc537f49521, in 
[planning_context_manager_node-2] #13   Object "/opt/ros/humble/lib/librmw_fastrtps_shared_cpp.so", at 0x7fc52220f13b, in rmw_fastrtps_shared_cpp::__rmw_destroy_publisher(char const*, rmw_node_s const*, rmw_publisher_s*)
[planning_context_manager_node-2] #12   Object "/opt/ros/humble/lib/librmw_fastrtps_shared_cpp.so", at 0x7fc52220eccb, in rmw_fastrtps_shared_cpp::destroy_publisher(char const*, CustomParticipantInfo*, rmw_publisher_s*)
[planning_context_manager_node-2] #11   Object "/opt/ros/humble/lib/libfastrtps.so.2.6.6", at 0x7fc521ce29fe, in eprosima::fastdds::dds::PublisherImpl::delete_datawriter(eprosima::fastdds::dds::DataWriter const*)
[planning_context_manager_node-2] #10   Object "/opt/ros/humble/lib/libfastrtps.so.2.6.6", at 0x7fc521cc9298, in eprosima::fastdds::dds::DataWriterImpl::~DataWriterImpl()
[planning_context_manager_node-2] #9    Object "/opt/ros/humble/lib/libfastrtps.so.2.6.6", at 0x7fc521b5de42, in 
[planning_context_manager_node-2] #8    Object "/usr/lib/x86_64-linux-gnu/libgcc_s.so.1", at 0x7fc537cf72dc, in _Unwind_Resume
[planning_context_manager_node-2] #7    Object "/usr/lib/x86_64-linux-gnu/libgcc_s.so.1", at 0x7fc537cf6883, in 
[planning_context_manager_node-2] #6    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fc537dad958, in __gxx_personality_v0
[planning_context_manager_node-2] #5    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fc537dad1e8, in 
[planning_context_manager_node-2] #4    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fc537dae20b, in 
[planning_context_manager_node-2] #3    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fc537da2b9d, in 
[planning_context_manager_node-2] #2    Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc537adf7f2, in abort
[planning_context_manager_node-2] #1    Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc537af9475, in raise
[planning_context_manager_node-2] #0    Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc537b4d9fc, in pthread_kill
[planning_context_manager_node-2] Aborted (Signal sent by tkill() 10370 1000)

sea-bass avatar Jan 15 '24 04:01 sea-bass

Hi, I'm hitting I think the same bug.

Here's the backtrace:

#5  0x00007ffff7c998fd in operator new(unsigned long) () from /usr/lib/x86_64-linux-gnu/libjemalloc.so.2
#6  0x00007ffff6c7c753 in __gnu_cxx::new_allocator<eprosima::fastdds::dds::Condition*>::allocate (this=0x7fffec8fa6e0, __n=<optimized out>) at /usr/include/c++/11/ext/new_allocator.h:103
#7  std::allocator_traits<std::allocator<eprosima::fastdds::dds::Condition*> >::allocate (__n=<optimized out>, __a=...) at /usr/include/c++/11/bits/alloc_traits.h:464
#8  std::_Vector_base<eprosima::fastdds::dds::Condition*, std::allocator<eprosima::fastdds::dds::Condition*> >::_M_allocate (__n=<optimized out>, this=<optimized out>)
    at /usr/include/c++/11/bits/stl_vector.h:346
#9  std::vector<eprosima::fastdds::dds::Condition*, std::allocator<eprosima::fastdds::dds::Condition*> >::_M_realloc_insert<eprosima::fastdds::dds::Condition*> (this=this@entry=0x7fffec8fa6e0, 
    __position=0x0) at /usr/include/c++/11/bits/vector.tcc:440
#10 0x00007ffff6c7c256 in std::vector<eprosima::fastdds::dds::Condition*, std::allocator<eprosima::fastdds::dds::Condition*> >::emplace_back<eprosima::fastdds::dds::Condition*> (this=0x7fffec8fa6e0)
    at /usr/include/c++/11/bits/vector.tcc:121
#11 std::vector<eprosima::fastdds::dds::Condition*, std::allocator<eprosima::fastdds::dds::Condition*> >::push_back (__x=@0x7fffec8fa700: 0x7fffed1286c0, this=0x7fffec8fa6e0)
    at /usr/include/c++/11/bits/stl_vector.h:1204
#12 rmw_fastrtps_shared_cpp::__rmw_wait (identifier=<optimized out>, subscriptions=0x7ffff6945e08, guard_conditions=<optimized out>, services=0x7ffff6945e50, clients=0x7ffff6945e38, 
    events=0x7ffff6945e68, wait_set=0x7ffff69ef6a0, wait_timeout=0x7fffec8fa840) at /home/user/ros2_iron/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_wait.cpp:161
#13 0x00007ffff6ce1ffa in rmw_wait (subscriptions=<optimized out>, guard_conditions=<optimized out>, services=<optimized out>, clients=<optimized out>, events=<optimized out>, wait_set=<optimized out>, 
    wait_timeout=0x7fffec8fa840) at /home/user/ros2_iron/src/ros2/rmw_fastrtps/rmw_fastrtps_cpp/src/rmw_wait.cpp:33
#14 0x00007ffff79701a8 in rcl_wait (wait_set=0x7ffff04d0cc0, timeout=-1) at /home/user/ros2_iron/src/ros2/rcl/rcl/src/rcl/wait.c:595
#15 0x00007ffff7789a0b in rclcpp::Executor::wait_for_work (this=0x7ffff04d0c90, timeout=...) at /usr/include/c++/11/chrono:521
#16 0x00007ffff778a073 in rclcpp::Executor::get_next_executable (this=0x7ffff04d0c90, any_executable=..., timeout=std::chrono::duration = { -1ns })
    at /home/user/ros2_iron/src/ros2/rclcpp/rclcpp/src/rclcpp/executor.cpp:965
#17 0x00007ffff779a372 in rclcpp::executors::MultiThreadedExecutor::run (this=0x7ffff04d0c90, this_thread_number=<optimized out>)
    at /home/user/ros2_iron/src/ros2/rclcpp/rclcpp/src/rclcpp/executors/multi_threaded_executor.cpp:92
#18 0x00007ffff72e6793 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#19 0x00007ffff6e94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#20 0x00007ffff6f26660 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Any updates since January? I can trigger this bug just by spinning at something like 200 Hz an rclcpp::Timer inside a node loaded in a MultiThreadedExecutor.

Update: as per last @Barry-Xu-2018 post in this thread, I believe it's more a problem at rclcpp level because I just tried with CycloneDDS and I hit the same issue.

roncapat avatar Mar 20 '24 16:03 roncapat

@roncapat

I can trigger this bug just by spinning at something like 200 Hz an rclcpp::Timer inside a node loaded in a MultiThreadedExecutor.

can you provide the reproducible example? i cannot reproduce the issue with https://github.com/fujitatomoya/ros2_test_prover/commit/db4a5abe375cdb088e46931f843230ba558868aa

i think we need add and remove nodes to&from the executor to make it happen?

I believe it's more a problem at rclcpp level because I just tried with CycloneDDS and I hit the same issue.

thanks for checking, once it is confirmed, i guess we can move this issue to rclcpp.

I don't know why current implementation don't use shared pointer to save GuardCondition.

me neither, this is introduced on https://github.com/ros2/rclcpp/pull/1612. (in particular https://github.com/ros2/rclcpp/compare/b9bec69377dd3850c6db519a7af80a2eb1a9be31..2fc5501a09a17d286f9a68fa1e7bff4f8703cf1e) before addressing this, i want to confirm the problem 1st with reproducible environment if possible.

fujitatomoya avatar Mar 27 '24 23:03 fujitatomoya

@fujitatomoya see linked rclcpp issue: https://github.com/ros2/rclcpp/issues/2455

It all started from there for me. There is attached a minimum example.

roncapat avatar Mar 28 '24 04:03 roncapat

@sea-bass i believe that would be worth to give it a shot with current rolling branch, that includes https://github.com/ros2/rclcpp/pull/2142.

CC: @mjcarroll

fujitatomoya avatar Apr 30 '24 02:04 fujitatomoya

@sea-bass i believe that would be worth to give it a shot with current rolling branch, that includes ros2/rclcpp#2142.

CC: @mjcarroll

Thank you for the update, and for the actual fix! This was a big PR.

@ezrabrooks @mikewrock Putting this on your radar if you guys are at some point updating MoveIt Pro to 24.04 at some point and would like to try FastDDS.

sea-bass avatar Apr 30 '24 10:04 sea-bass