rmw_fastrtps Segfault in MultiThreadedExecutor using `rmw_fastrtps

trafficstars

Bug report

Required Info:

Operating System:
- Ubuntu 22.04 (inside a Docker container)
Installation type:
- Binaries
Version or commit hash:
- The latest available binaries on Humble
DDS implementation:
- rmw_fastrtps_cpp
Client library (if applicable):
- rclcpp

Steps to reproduce issue

We unfortunately don't have a full reproduction case, but we are creating a node with a multi-threaded executor as follows:

rclcpp::executors::MultiThreadedExecutor exec(rclcpp::ExecutorOptions(), 0, false, std::chrono::milliseconds(250));

The segfault we see is:

[objective_server_node_main-8] [ERROR] [1684327668.068245959] [objective_server_node]: Execution of behavior tree for objective `3 Waypoints Pick and Place` did not succeed.
[objective_server_node_main-8] Stack trace (most recent call last) in thread 833:
[objective_server_node_main-8] #8    Object "/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1", at 0xffffffffffffffff, in 
[objective_server_node_main-8] #7    Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0xffff9e5a5d1b, in 
[objective_server_node_main-8] #6    Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0xffff9e53d5c7, in 
[objective_server_node_main-8] #5    Object "/usr/lib/aarch64-linux-gnu/libstdc++.so.6.0.30", at 0xffff9e7731fb, in 
[objective_server_node_main-8] [1684327680.555]: Sequence                  IDLE -> RUNNING
[objective_server_node_main-8] [1684327680.555]: ClearSnapshot             IDLE -> RUNNING
[objective_server_node_main-8] #4    Source "./src/rclcpp/executors/multi_threaded_executor.cpp", line 85, in run [0xffff9edbffc7]
[objective_server_node_main-8] #3    Source "./src/rclcpp/executor.cpp", line 912, in get_next_executable [0xffff9edb8b5b]
[objective_server_node_main-8] #2    Source "./src/rclcpp/executor.cpp", line 756, in wait_for_work [0xffff9edb5bdf]
[objective_server_node_main-8] #1    Object "/opt/ros/humble/lib/librcl.so", at 0xffff9e185b97, in rcl_wait
[objective_server_node_main-8] #0    Object "/opt/ros/humble/lib/librmw_fastrtps_shared_cpp.so", at 0xffff9d5f5718, in rmw_fastrtps_shared_cpp::__rmw_wait(char const*, rmw_subscriptions_s*, rmw_guard_conditions_s*, rmw_services_s*, rmw_clients_s*, rmw_events_s*, rmw_wait_set_s*, rmw_time_s const*)
[objective_server_node_main-8] Segmentation fault (Address not mapped to object [0xaaa1b04a5f08])
[ERROR] [objective_server_node_main-8]: process has died [pid 666, exit code -11, cmd '/opt/overlay_ws/install/moveit_studio_agent/lib/moveit_studio_agent/objective_server_node_main --ros-args --log-level error --ros-args -r __node:=objective_server_node --params-file /tmp/launch_params_j4cwbvx_ --params-file /tmp/launch_params_7oyj5lbr --params-file /tmp/launch_params_2aqbg8_q --params-file /tmp/launch_params_12q1oav_ --params-file /tmp/launch_params_h4pxzy2y --params-file /tmp/launch_params_8vn2ch2o --params-file /tmp/launch_params_1c4b41ac --params-file /tmp/launch_params_blie9quv'].

One of our engineers looked into this with a debugger and saw:

Thread 25 "objective_serve" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xffffa535e660 (LWP 74792)]
0x0000fffff6417588 in rmw_fastrtps_shared_cpp::__rmw_wait (identifier=, subscriptions=, guard_conditions=0xaaaaab4e0b30, services=, clients=, events=, wait_set=, wait_timeout=) at /opt/underlay_ws/src/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_wait.cpp:220
220 if (!condition->get_trigger_value()) {
(gdb) bt
#0 0x0000fffff6417588 in rmw_fastrtps_shared_cpp::__rmw_wait (
identifier=, subscriptions=,
guard_conditions=0xaaaaab4e0b30, services=,
clients=, events=, wait_set=,
wait_timeout=)
at /opt/underlay_ws/src/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_wait.cpp:220
#1 0x0000fffff6fc5b98 in rcl_wait (wait_set=0xffffffffa110, timeout=250000000)
at ./src/rcl/wait.c:595
#2 0x0000fffff7bf5be0 in rclcpp::Executor::wait_for_work (
this=0xffffffffa060, timeout=...) at /usr/include/c++/11/chrono:521
#3 0x0000fffff7bf8b5c in rclcpp::Executor::get_next_executable (
this=0xffffffffa060, any_executable=..., timeout=...)
at ./src/rclcpp/executor.cpp:912
#4 0x0000fffff7bfffc8 in rclcpp::executors::MultiThreadedExecutor::run (
this=0xffffffffa060, this_thread_number=)
at ./src/rclcpp/executors/multi_threaded_executor.cpp:85
#5 0x0000fffff75b31fc in ?? () from /lib/aarch64-linux-gnu/libstdc++.so.6
#6 0x0000fffff737d5c8 in ?? () from /lib/aarch64-linux-gnu/libc.so.6
#7 0x0000fffff73e5d1c in ?? () from /lib/aarch64-linux-gnu/libc.so.6`

Additional information

This segfault does not occur using Cyclone DDS.

Implementation considerations

Oct 25 '23 14:10 sea-bass

Segmentation fault (Address not mapped to object [0xaaa1b04a5f08]) 0x0000fffff6417588 in rmw_fastrtps_shared_cpp::__rmw_wait (identifier=, subscriptions=, guard_conditions=0xaaaaab4e0b30, services=, clients=, events=, wait_set=, wait_timeout=) at /opt/underlay_ws/src/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_wait.cpp:220 220 if (!condition->get_trigger_value()) {

https://github.com/ros2/rmw_fastrtps/blob/247e491e20ed53f1d3b33dfff78a8130c091161c/rmw_fastrtps_shared_cpp/src/rmw_wait.cpp#L220

static_cast<eprosima::fastdds::dds::GuardCondition *>(data) is NULL, that led to the segmentation fault.

rclcpp::executors::MultiThreadedExecutor exec(rclcpp::ExecutorOptions(), 0, false, std::chrono::milliseconds(250));

could you share more conditions when this problem happens so that we can try to reproduce the issue? e.g) do you manage your own rclcpp::GuardCondition?

Oct 25 '23 16:10 fujitatomoya

CC: @Barry-Xu-2018 @iuhilnehc-ynos

Oct 25 '23 16:10 fujitatomoya

could you share more conditions when this problem happens so that we can try to reproduce the issue? e.g) do you manage your own rclcpp::GuardCondition?

Unfortunately the code is part of a private code base and we haven't been able to come up with a minimal repro case. But our main is fairly simple:

int main(int argc, char* argv[])
{
  rclcpp::init(argc, argv);
  rclcpp::NodeOptions opt;
  opt.automatically_declare_parameters_from_overrides(true);
  auto node = std::make_shared<OurNodeType>(opt);

  // This's a workaround a bug in MultiThreadedExecutor deadlock when spinning without a timeout
  rclcpp::executors::MultiThreadedExecutor exec(rclcpp::ExecutorOptions(), 0, false, std::chrono::milliseconds(250));
  exec.add_node(node->get_node_base_interface());
  exec.spin();
  exec.remove_node(node->get_node_base_interface());

  rclcpp::shutdown();
}

The node itself has an action server in it, and we notice the segfault occurring -- not consistently, but frequently -- when one action is cancelled and we try start a new one.

Another thing to note is that these actions that execute can (and often do) create their own subscribers, publishers, service/action clients, etc. So maybe that also plays into the problem?

Oct 25 '23 16:10 sea-bass

@fujitatomoya

According to current information, the problem is related to https://github.com/ros2/rclcpp/blob/8709146df87ab860c88f03a83f6df7d0347d0085/rclcpp/include/rclcpp/strategies/allocator_memory_strategy.hpp#L494 Here, shared pointer wasn't used. It is used in Executor::wait_for_work() without lock.

It can be removed under the below function

Executor::wait_for_work()
Executor::remove_callback_group_from_map() called by Executor::remove_callback_group() or Executor::remove_node()
Executor::~Executor()

So, while GuardCondition is used in Executor::wait_for_work(), it is possible that it is removed by Executor::remove_callback_group_from_map().
I don't know why current implementation don't use shared pointer to save GuardCondition.
Besides, remove action should be in Executor::wait_for_work() and Executor::~Executor().

Oct 28 '23 02:10 Barry-Xu-2018

I'm just coming back to this issue and I think adding locks to the functions in AllocatorMemoryStrategy that touch the guard_conditions_ member variable seems promising. I might have time this week to take a look and land some fixes.

If this works, seems like this file hasn't been modified in a while so I'm hopeful we can backport the fix to humble.

Jan 14 '24 23:01 sea-bass

Update: I tried putting locks in both allocator_memory_strategy.cpp and executor.cpp and the chain of FastDDS issues keeps elevating up the chain with more segfaults.

I think the problem is more pervasive than just stuff at the rclcpp layer, sadly.

[planning_context_manager_node-2] terminate called after throwing an instance of 'std::out_of_range'
[planning_context_manager_node-2]   what():  vector::_M_range_check: __n (which is 2674159792) >= this->size() (which is 142)
[planning_context_manager_node-2] Stack trace (most recent call last) in thread 10468:
[move_group-1] double free or corruption (out)
[move_group-1] Stack trace (most recent call last) in thread 10409:
[waypoint_manager_node-4] corrupted size vs. prev_size
[planning_context_manager_node-2] #22   Object "", at 0xffffffffffffffff, in 
[planning_context_manager_node-2] #21   Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc537bdd84f, in 
[planning_context_manager_node-2] #20   Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc537b4bac2, in 
[planning_context_manager_node-2] #19   Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fc537ddc252, in 
[component_container_mt-8] double free or corruption (out)
[planning_context_manager_node-2] #18   Source "/opt/underlay_ws/src/rclcpp/rclcpp/src/rclcpp/signal_handler.cpp", line 271, in deferred_signal_handler [0x7fc53868087a]
[planning_context_manager_node-2]         268:             "deferred_signal_handler(): "
[planning_context_manager_node-2]         269:             "shutting down rclcpp::Context @ %p, because it had shutdown_on_signal == true",
[planning_context_manager_node-2]         270:             static_cast<void *>(context_ptr.get()));
[planning_context_manager_node-2]       > 271:           context_ptr->shutdown("signal handler");
[planning_context_manager_node-2]         272:         }
[planning_context_manager_node-2]         273:       }
[planning_context_manager_node-2]         274:     }
[planning_context_manager_node-2] #17   Source "/opt/underlay_ws/src/rclcpp/rclcpp/src/rclcpp/context.cpp", line 343, in shutdown [0x7fc5385b3ae9]
[planning_context_manager_node-2]         340:     std::lock_guard<std::recursive_mutex> guard(*logging_mutex_);
[planning_context_manager_node-2]         341:     size_t & count = get_logging_reference_count();
[planning_context_manager_node-2]         342:     if (0u == --count) {
[planning_context_manager_node-2]       > 343:       rcl_ret_t rcl_ret = rcl_logging_fini();
[planning_context_manager_node-2]         344:       if (RCL_RET_OK != rcl_ret) {
[planning_context_manager_node-2]         345:         RCUTILS_SAFE_FWRITE_TO_STDERR(
[planning_context_manager_node-2]         346:           RCUTILS_STRINGIFY(__file__) ":"
[planning_context_manager_node-2] #16   Object "/opt/ros/humble/lib/librcl.so", at 0x7fc537f4a1b6, in rcl_logging_fini
[planning_context_manager_node-2] #15   Object "/opt/ros/humble/lib/librcl.so", at 0x7fc537f49dcc, in 
[planning_context_manager_node-2] #14   Object "/opt/ros/humble/lib/librcl.so", at 0x7fc537f49521, in 
[planning_context_manager_node-2] #13   Object "/opt/ros/humble/lib/librmw_fastrtps_shared_cpp.so", at 0x7fc52220f13b, in rmw_fastrtps_shared_cpp::__rmw_destroy_publisher(char const*, rmw_node_s const*, rmw_publisher_s*)
[planning_context_manager_node-2] #12   Object "/opt/ros/humble/lib/librmw_fastrtps_shared_cpp.so", at 0x7fc52220eccb, in rmw_fastrtps_shared_cpp::destroy_publisher(char const*, CustomParticipantInfo*, rmw_publisher_s*)
[planning_context_manager_node-2] #11   Object "/opt/ros/humble/lib/libfastrtps.so.2.6.6", at 0x7fc521ce29fe, in eprosima::fastdds::dds::PublisherImpl::delete_datawriter(eprosima::fastdds::dds::DataWriter const*)
[planning_context_manager_node-2] #10   Object "/opt/ros/humble/lib/libfastrtps.so.2.6.6", at 0x7fc521cc9298, in eprosima::fastdds::dds::DataWriterImpl::~DataWriterImpl()
[planning_context_manager_node-2] #9    Object "/opt/ros/humble/lib/libfastrtps.so.2.6.6", at 0x7fc521b5de42, in 
[planning_context_manager_node-2] #8    Object "/usr/lib/x86_64-linux-gnu/libgcc_s.so.1", at 0x7fc537cf72dc, in _Unwind_Resume
[planning_context_manager_node-2] #7    Object "/usr/lib/x86_64-linux-gnu/libgcc_s.so.1", at 0x7fc537cf6883, in 
[planning_context_manager_node-2] #6    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fc537dad958, in __gxx_personality_v0
[planning_context_manager_node-2] #5    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fc537dad1e8, in 
[planning_context_manager_node-2] #4    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fc537dae20b, in 
[planning_context_manager_node-2] #3    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fc537da2b9d, in 
[planning_context_manager_node-2] #2    Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc537adf7f2, in abort
[planning_context_manager_node-2] #1    Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc537af9475, in raise
[planning_context_manager_node-2] #0    Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc537b4d9fc, in pthread_kill
[planning_context_manager_node-2] Aborted (Signal sent by tkill() 10370 1000)

Jan 15 '24 04:01 sea-bass

Hi, I'm hitting I think the same bug.

Here's the backtrace:

#5  0x00007ffff7c998fd in operator new(unsigned long) () from /usr/lib/x86_64-linux-gnu/libjemalloc.so.2
#6  0x00007ffff6c7c753 in __gnu_cxx::new_allocator<eprosima::fastdds::dds::Condition*>::allocate (this=0x7fffec8fa6e0, __n=<optimized out>) at /usr/include/c++/11/ext/new_allocator.h:103
#7  std::allocator_traits<std::allocator<eprosima::fastdds::dds::Condition*> >::allocate (__n=<optimized out>, __a=...) at /usr/include/c++/11/bits/alloc_traits.h:464
#8  std::_Vector_base<eprosima::fastdds::dds::Condition*, std::allocator<eprosima::fastdds::dds::Condition*> >::_M_allocate (__n=<optimized out>, this=<optimized out>)
    at /usr/include/c++/11/bits/stl_vector.h:346
#9  std::vector<eprosima::fastdds::dds::Condition*, std::allocator<eprosima::fastdds::dds::Condition*> >::_M_realloc_insert<eprosima::fastdds::dds::Condition*> (this=this@entry=0x7fffec8fa6e0, 
    __position=0x0) at /usr/include/c++/11/bits/vector.tcc:440
#10 0x00007ffff6c7c256 in std::vector<eprosima::fastdds::dds::Condition*, std::allocator<eprosima::fastdds::dds::Condition*> >::emplace_back<eprosima::fastdds::dds::Condition*> (this=0x7fffec8fa6e0)
    at /usr/include/c++/11/bits/vector.tcc:121
#11 std::vector<eprosima::fastdds::dds::Condition*, std::allocator<eprosima::fastdds::dds::Condition*> >::push_back (__x=@0x7fffec8fa700: 0x7fffed1286c0, this=0x7fffec8fa6e0)
    at /usr/include/c++/11/bits/stl_vector.h:1204
#12 rmw_fastrtps_shared_cpp::__rmw_wait (identifier=<optimized out>, subscriptions=0x7ffff6945e08, guard_conditions=<optimized out>, services=0x7ffff6945e50, clients=0x7ffff6945e38, 
    events=0x7ffff6945e68, wait_set=0x7ffff69ef6a0, wait_timeout=0x7fffec8fa840) at /home/user/ros2_iron/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_wait.cpp:161
#13 0x00007ffff6ce1ffa in rmw_wait (subscriptions=<optimized out>, guard_conditions=<optimized out>, services=<optimized out>, clients=<optimized out>, events=<optimized out>, wait_set=<optimized out>, 
    wait_timeout=0x7fffec8fa840) at /home/user/ros2_iron/src/ros2/rmw_fastrtps/rmw_fastrtps_cpp/src/rmw_wait.cpp:33
#14 0x00007ffff79701a8 in rcl_wait (wait_set=0x7ffff04d0cc0, timeout=-1) at /home/user/ros2_iron/src/ros2/rcl/rcl/src/rcl/wait.c:595
#15 0x00007ffff7789a0b in rclcpp::Executor::wait_for_work (this=0x7ffff04d0c90, timeout=...) at /usr/include/c++/11/chrono:521
#16 0x00007ffff778a073 in rclcpp::Executor::get_next_executable (this=0x7ffff04d0c90, any_executable=..., timeout=std::chrono::duration = { -1ns })
    at /home/user/ros2_iron/src/ros2/rclcpp/rclcpp/src/rclcpp/executor.cpp:965
#17 0x00007ffff779a372 in rclcpp::executors::MultiThreadedExecutor::run (this=0x7ffff04d0c90, this_thread_number=<optimized out>)
    at /home/user/ros2_iron/src/ros2/rclcpp/rclcpp/src/rclcpp/executors/multi_threaded_executor.cpp:92
#18 0x00007ffff72e6793 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#19 0x00007ffff6e94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#20 0x00007ffff6f26660 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Any updates since January? I can trigger this bug just by spinning at something like 200 Hz an rclcpp::Timer inside a node loaded in a MultiThreadedExecutor.

Update: as per last @Barry-Xu-2018 post in this thread, I believe it's more a problem at rclcpp level because I just tried with CycloneDDS and I hit the same issue.

Mar 20 '24 16:03 roncapat

@roncapat

I can trigger this bug just by spinning at something like 200 Hz an rclcpp::Timer inside a node loaded in a MultiThreadedExecutor.

can you provide the reproducible example? i cannot reproduce the issue with https://github.com/fujitatomoya/ros2_test_prover/commit/db4a5abe375cdb088e46931f843230ba558868aa

i think we need add and remove nodes to&from the executor to make it happen?

I believe it's more a problem at rclcpp level because I just tried with CycloneDDS and I hit the same issue.

thanks for checking, once it is confirmed, i guess we can move this issue to rclcpp.

I don't know why current implementation don't use shared pointer to save GuardCondition.

me neither, this is introduced on https://github.com/ros2/rclcpp/pull/1612. (in particular https://github.com/ros2/rclcpp/compare/b9bec69377dd3850c6db519a7af80a2eb1a9be31..2fc5501a09a17d286f9a68fa1e7bff4f8703cf1e) before addressing this, i want to confirm the problem 1st with reproducible environment if possible.

Mar 27 '24 23:03 fujitatomoya

@fujitatomoya see linked rclcpp issue: https://github.com/ros2/rclcpp/issues/2455

It all started from there for me. There is attached a minimum example.

Mar 28 '24 04:03 roncapat

@sea-bass i believe that would be worth to give it a shot with current rolling branch, that includes https://github.com/ros2/rclcpp/pull/2142.

CC: @mjcarroll

Apr 30 '24 02:04 fujitatomoya

@sea-bass i believe that would be worth to give it a shot with current rolling branch, that includes ros2/rclcpp#2142.

CC: @mjcarroll

Thank you for the update, and for the actual fix! This was a big PR.

@ezrabrooks @mikewrock Putting this on your radar if you guys are at some point updating MoveIt Pro to 24.04 at some point and would like to try FastDDS.

Apr 30 '24 10:04 sea-bass

rmw_fastrtps
rmw_fastrtps copied to clipboard

Segfault in MultiThreadedExecutor using `rmw_fastrtps_cpp`

Bug report

Steps to reproduce issue

Additional information

Implementation considerations

rmw_fastrtps rmw_fastrtps copied to clipboard

Segfault in MultiThreadedExecutor using `rmw_fastrtps_cpp`

Bug report

Steps to reproduce issue

Additional information

Implementation considerations

rmw_fastrtps
rmw_fastrtps copied to clipboard