rmw_fastrtps
rmw_fastrtps copied to clipboard
Segfault in MultiThreadedExecutor using `rmw_fastrtps_cpp`
Bug report
Required Info:
- Operating System:
- Ubuntu 22.04 (inside a Docker container)
- Installation type:
- Binaries
- Version or commit hash:
- The latest available binaries on Humble
- DDS implementation:
rmw_fastrtps_cpp
- Client library (if applicable):
rclcpp
Steps to reproduce issue
We unfortunately don't have a full reproduction case, but we are creating a node with a multi-threaded executor as follows:
rclcpp::executors::MultiThreadedExecutor exec(rclcpp::ExecutorOptions(), 0, false, std::chrono::milliseconds(250));
The segfault we see is:
[objective_server_node_main-8] [ERROR] [1684327668.068245959] [objective_server_node]: Execution of behavior tree for objective `3 Waypoints Pick and Place` did not succeed.
[objective_server_node_main-8] Stack trace (most recent call last) in thread 833:
[objective_server_node_main-8] #8 Object "/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1", at 0xffffffffffffffff, in
[objective_server_node_main-8] #7 Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0xffff9e5a5d1b, in
[objective_server_node_main-8] #6 Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0xffff9e53d5c7, in
[objective_server_node_main-8] #5 Object "/usr/lib/aarch64-linux-gnu/libstdc++.so.6.0.30", at 0xffff9e7731fb, in
[objective_server_node_main-8] [1684327680.555]: Sequence IDLE -> RUNNING
[objective_server_node_main-8] [1684327680.555]: ClearSnapshot IDLE -> RUNNING
[objective_server_node_main-8] #4 Source "./src/rclcpp/executors/multi_threaded_executor.cpp", line 85, in run [0xffff9edbffc7]
[objective_server_node_main-8] #3 Source "./src/rclcpp/executor.cpp", line 912, in get_next_executable [0xffff9edb8b5b]
[objective_server_node_main-8] #2 Source "./src/rclcpp/executor.cpp", line 756, in wait_for_work [0xffff9edb5bdf]
[objective_server_node_main-8] #1 Object "/opt/ros/humble/lib/librcl.so", at 0xffff9e185b97, in rcl_wait
[objective_server_node_main-8] #0 Object "/opt/ros/humble/lib/librmw_fastrtps_shared_cpp.so", at 0xffff9d5f5718, in rmw_fastrtps_shared_cpp::__rmw_wait(char const*, rmw_subscriptions_s*, rmw_guard_conditions_s*, rmw_services_s*, rmw_clients_s*, rmw_events_s*, rmw_wait_set_s*, rmw_time_s const*)
[objective_server_node_main-8] Segmentation fault (Address not mapped to object [0xaaa1b04a5f08])
[ERROR] [objective_server_node_main-8]: process has died [pid 666, exit code -11, cmd '/opt/overlay_ws/install/moveit_studio_agent/lib/moveit_studio_agent/objective_server_node_main --ros-args --log-level error --ros-args -r __node:=objective_server_node --params-file /tmp/launch_params_j4cwbvx_ --params-file /tmp/launch_params_7oyj5lbr --params-file /tmp/launch_params_2aqbg8_q --params-file /tmp/launch_params_12q1oav_ --params-file /tmp/launch_params_h4pxzy2y --params-file /tmp/launch_params_8vn2ch2o --params-file /tmp/launch_params_1c4b41ac --params-file /tmp/launch_params_blie9quv'].
One of our engineers looked into this with a debugger and saw:
Thread 25 "objective_serve" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xffffa535e660 (LWP 74792)]
0x0000fffff6417588 in rmw_fastrtps_shared_cpp::__rmw_wait (identifier=, subscriptions=, guard_conditions=0xaaaaab4e0b30, services=, clients=, events=, wait_set=, wait_timeout=) at /opt/underlay_ws/src/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_wait.cpp:220
220 if (!condition->get_trigger_value()) {
(gdb) bt
#0 0x0000fffff6417588 in rmw_fastrtps_shared_cpp::__rmw_wait (
identifier=, subscriptions=,
guard_conditions=0xaaaaab4e0b30, services=,
clients=, events=, wait_set=,
wait_timeout=)
at /opt/underlay_ws/src/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_wait.cpp:220
#1 0x0000fffff6fc5b98 in rcl_wait (wait_set=0xffffffffa110, timeout=250000000)
at ./src/rcl/wait.c:595
#2 0x0000fffff7bf5be0 in rclcpp::Executor::wait_for_work (
this=0xffffffffa060, timeout=...) at /usr/include/c++/11/chrono:521
#3 0x0000fffff7bf8b5c in rclcpp::Executor::get_next_executable (
this=0xffffffffa060, any_executable=..., timeout=...)
at ./src/rclcpp/executor.cpp:912
#4 0x0000fffff7bfffc8 in rclcpp::executors::MultiThreadedExecutor::run (
this=0xffffffffa060, this_thread_number=)
at ./src/rclcpp/executors/multi_threaded_executor.cpp:85
#5 0x0000fffff75b31fc in ?? () from /lib/aarch64-linux-gnu/libstdc++.so.6
#6 0x0000fffff737d5c8 in ?? () from /lib/aarch64-linux-gnu/libc.so.6
#7 0x0000fffff73e5d1c in ?? () from /lib/aarch64-linux-gnu/libc.so.6`
Additional information
This segfault does not occur using Cyclone DDS.
Implementation considerations
Segmentation fault (Address not mapped to object [0xaaa1b04a5f08]) 0x0000fffff6417588 in rmw_fastrtps_shared_cpp::__rmw_wait (identifier=, subscriptions=, guard_conditions=0xaaaaab4e0b30, services=, clients=, events=, wait_set=, wait_timeout=) at /opt/underlay_ws/src/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_wait.cpp:220 220 if (!condition->get_trigger_value()) {
https://github.com/ros2/rmw_fastrtps/blob/247e491e20ed53f1d3b33dfff78a8130c091161c/rmw_fastrtps_shared_cpp/src/rmw_wait.cpp#L220
static_cast<eprosima::fastdds::dds::GuardCondition *>(data) is NULL, that led to the segmentation fault.
rclcpp::executors::MultiThreadedExecutor exec(rclcpp::ExecutorOptions(), 0, false, std::chrono::milliseconds(250));
could you share more conditions when this problem happens so that we can try to reproduce the issue? e.g) do you manage your own rclcpp::GuardCondition?
CC: @Barry-Xu-2018 @iuhilnehc-ynos
could you share more conditions when this problem happens so that we can try to reproduce the issue? e.g) do you manage your own
rclcpp::GuardCondition?
Unfortunately the code is part of a private code base and we haven't been able to come up with a minimal repro case. But our main is fairly simple:
int main(int argc, char* argv[])
{
rclcpp::init(argc, argv);
rclcpp::NodeOptions opt;
opt.automatically_declare_parameters_from_overrides(true);
auto node = std::make_shared<OurNodeType>(opt);
// This's a workaround a bug in MultiThreadedExecutor deadlock when spinning without a timeout
rclcpp::executors::MultiThreadedExecutor exec(rclcpp::ExecutorOptions(), 0, false, std::chrono::milliseconds(250));
exec.add_node(node->get_node_base_interface());
exec.spin();
exec.remove_node(node->get_node_base_interface());
rclcpp::shutdown();
}
The node itself has an action server in it, and we notice the segfault occurring -- not consistently, but frequently -- when one action is cancelled and we try start a new one.
Another thing to note is that these actions that execute can (and often do) create their own subscribers, publishers, service/action clients, etc. So maybe that also plays into the problem?
@fujitatomoya
According to current information, the problem is related to https://github.com/ros2/rclcpp/blob/8709146df87ab860c88f03a83f6df7d0347d0085/rclcpp/include/rclcpp/strategies/allocator_memory_strategy.hpp#L494 Here, shared pointer wasn't used. It is used in Executor::wait_for_work() without lock.
It can be removed under the below function
- Executor::wait_for_work()
- Executor::remove_callback_group_from_map() called by Executor::remove_callback_group() or Executor::remove_node()
- Executor::~Executor()
So, while GuardCondition is used in Executor::wait_for_work(), it is possible that it is removed by Executor::remove_callback_group_from_map().
I don't know why current implementation don't use shared pointer to save GuardCondition.
Besides, remove action should be in Executor::wait_for_work() and Executor::~Executor().
I'm just coming back to this issue and I think adding locks to the functions in AllocatorMemoryStrategy that touch the guard_conditions_ member variable seems promising. I might have time this week to take a look and land some fixes.
If this works, seems like this file hasn't been modified in a while so I'm hopeful we can backport the fix to humble.
Update: I tried putting locks in both allocator_memory_strategy.cpp and executor.cpp and the chain of FastDDS issues keeps elevating up the chain with more segfaults.
I think the problem is more pervasive than just stuff at the rclcpp layer, sadly.
[planning_context_manager_node-2] terminate called after throwing an instance of 'std::out_of_range'
[planning_context_manager_node-2] what(): vector::_M_range_check: __n (which is 2674159792) >= this->size() (which is 142)
[planning_context_manager_node-2] Stack trace (most recent call last) in thread 10468:
[move_group-1] double free or corruption (out)
[move_group-1] Stack trace (most recent call last) in thread 10409:
[waypoint_manager_node-4] corrupted size vs. prev_size
[planning_context_manager_node-2] #22 Object "", at 0xffffffffffffffff, in
[planning_context_manager_node-2] #21 Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc537bdd84f, in
[planning_context_manager_node-2] #20 Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc537b4bac2, in
[planning_context_manager_node-2] #19 Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fc537ddc252, in
[component_container_mt-8] double free or corruption (out)
[planning_context_manager_node-2] #18 Source "/opt/underlay_ws/src/rclcpp/rclcpp/src/rclcpp/signal_handler.cpp", line 271, in deferred_signal_handler [0x7fc53868087a]
[planning_context_manager_node-2] 268: "deferred_signal_handler(): "
[planning_context_manager_node-2] 269: "shutting down rclcpp::Context @ %p, because it had shutdown_on_signal == true",
[planning_context_manager_node-2] 270: static_cast<void *>(context_ptr.get()));
[planning_context_manager_node-2] > 271: context_ptr->shutdown("signal handler");
[planning_context_manager_node-2] 272: }
[planning_context_manager_node-2] 273: }
[planning_context_manager_node-2] 274: }
[planning_context_manager_node-2] #17 Source "/opt/underlay_ws/src/rclcpp/rclcpp/src/rclcpp/context.cpp", line 343, in shutdown [0x7fc5385b3ae9]
[planning_context_manager_node-2] 340: std::lock_guard<std::recursive_mutex> guard(*logging_mutex_);
[planning_context_manager_node-2] 341: size_t & count = get_logging_reference_count();
[planning_context_manager_node-2] 342: if (0u == --count) {
[planning_context_manager_node-2] > 343: rcl_ret_t rcl_ret = rcl_logging_fini();
[planning_context_manager_node-2] 344: if (RCL_RET_OK != rcl_ret) {
[planning_context_manager_node-2] 345: RCUTILS_SAFE_FWRITE_TO_STDERR(
[planning_context_manager_node-2] 346: RCUTILS_STRINGIFY(__file__) ":"
[planning_context_manager_node-2] #16 Object "/opt/ros/humble/lib/librcl.so", at 0x7fc537f4a1b6, in rcl_logging_fini
[planning_context_manager_node-2] #15 Object "/opt/ros/humble/lib/librcl.so", at 0x7fc537f49dcc, in
[planning_context_manager_node-2] #14 Object "/opt/ros/humble/lib/librcl.so", at 0x7fc537f49521, in
[planning_context_manager_node-2] #13 Object "/opt/ros/humble/lib/librmw_fastrtps_shared_cpp.so", at 0x7fc52220f13b, in rmw_fastrtps_shared_cpp::__rmw_destroy_publisher(char const*, rmw_node_s const*, rmw_publisher_s*)
[planning_context_manager_node-2] #12 Object "/opt/ros/humble/lib/librmw_fastrtps_shared_cpp.so", at 0x7fc52220eccb, in rmw_fastrtps_shared_cpp::destroy_publisher(char const*, CustomParticipantInfo*, rmw_publisher_s*)
[planning_context_manager_node-2] #11 Object "/opt/ros/humble/lib/libfastrtps.so.2.6.6", at 0x7fc521ce29fe, in eprosima::fastdds::dds::PublisherImpl::delete_datawriter(eprosima::fastdds::dds::DataWriter const*)
[planning_context_manager_node-2] #10 Object "/opt/ros/humble/lib/libfastrtps.so.2.6.6", at 0x7fc521cc9298, in eprosima::fastdds::dds::DataWriterImpl::~DataWriterImpl()
[planning_context_manager_node-2] #9 Object "/opt/ros/humble/lib/libfastrtps.so.2.6.6", at 0x7fc521b5de42, in
[planning_context_manager_node-2] #8 Object "/usr/lib/x86_64-linux-gnu/libgcc_s.so.1", at 0x7fc537cf72dc, in _Unwind_Resume
[planning_context_manager_node-2] #7 Object "/usr/lib/x86_64-linux-gnu/libgcc_s.so.1", at 0x7fc537cf6883, in
[planning_context_manager_node-2] #6 Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fc537dad958, in __gxx_personality_v0
[planning_context_manager_node-2] #5 Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fc537dad1e8, in
[planning_context_manager_node-2] #4 Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fc537dae20b, in
[planning_context_manager_node-2] #3 Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fc537da2b9d, in
[planning_context_manager_node-2] #2 Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc537adf7f2, in abort
[planning_context_manager_node-2] #1 Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc537af9475, in raise
[planning_context_manager_node-2] #0 Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc537b4d9fc, in pthread_kill
[planning_context_manager_node-2] Aborted (Signal sent by tkill() 10370 1000)
Hi, I'm hitting I think the same bug.
Here's the backtrace:
#5 0x00007ffff7c998fd in operator new(unsigned long) () from /usr/lib/x86_64-linux-gnu/libjemalloc.so.2
#6 0x00007ffff6c7c753 in __gnu_cxx::new_allocator<eprosima::fastdds::dds::Condition*>::allocate (this=0x7fffec8fa6e0, __n=<optimized out>) at /usr/include/c++/11/ext/new_allocator.h:103
#7 std::allocator_traits<std::allocator<eprosima::fastdds::dds::Condition*> >::allocate (__n=<optimized out>, __a=...) at /usr/include/c++/11/bits/alloc_traits.h:464
#8 std::_Vector_base<eprosima::fastdds::dds::Condition*, std::allocator<eprosima::fastdds::dds::Condition*> >::_M_allocate (__n=<optimized out>, this=<optimized out>)
at /usr/include/c++/11/bits/stl_vector.h:346
#9 std::vector<eprosima::fastdds::dds::Condition*, std::allocator<eprosima::fastdds::dds::Condition*> >::_M_realloc_insert<eprosima::fastdds::dds::Condition*> (this=this@entry=0x7fffec8fa6e0,
__position=0x0) at /usr/include/c++/11/bits/vector.tcc:440
#10 0x00007ffff6c7c256 in std::vector<eprosima::fastdds::dds::Condition*, std::allocator<eprosima::fastdds::dds::Condition*> >::emplace_back<eprosima::fastdds::dds::Condition*> (this=0x7fffec8fa6e0)
at /usr/include/c++/11/bits/vector.tcc:121
#11 std::vector<eprosima::fastdds::dds::Condition*, std::allocator<eprosima::fastdds::dds::Condition*> >::push_back (__x=@0x7fffec8fa700: 0x7fffed1286c0, this=0x7fffec8fa6e0)
at /usr/include/c++/11/bits/stl_vector.h:1204
#12 rmw_fastrtps_shared_cpp::__rmw_wait (identifier=<optimized out>, subscriptions=0x7ffff6945e08, guard_conditions=<optimized out>, services=0x7ffff6945e50, clients=0x7ffff6945e38,
events=0x7ffff6945e68, wait_set=0x7ffff69ef6a0, wait_timeout=0x7fffec8fa840) at /home/user/ros2_iron/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_wait.cpp:161
#13 0x00007ffff6ce1ffa in rmw_wait (subscriptions=<optimized out>, guard_conditions=<optimized out>, services=<optimized out>, clients=<optimized out>, events=<optimized out>, wait_set=<optimized out>,
wait_timeout=0x7fffec8fa840) at /home/user/ros2_iron/src/ros2/rmw_fastrtps/rmw_fastrtps_cpp/src/rmw_wait.cpp:33
#14 0x00007ffff79701a8 in rcl_wait (wait_set=0x7ffff04d0cc0, timeout=-1) at /home/user/ros2_iron/src/ros2/rcl/rcl/src/rcl/wait.c:595
#15 0x00007ffff7789a0b in rclcpp::Executor::wait_for_work (this=0x7ffff04d0c90, timeout=...) at /usr/include/c++/11/chrono:521
#16 0x00007ffff778a073 in rclcpp::Executor::get_next_executable (this=0x7ffff04d0c90, any_executable=..., timeout=std::chrono::duration = { -1ns })
at /home/user/ros2_iron/src/ros2/rclcpp/rclcpp/src/rclcpp/executor.cpp:965
#17 0x00007ffff779a372 in rclcpp::executors::MultiThreadedExecutor::run (this=0x7ffff04d0c90, this_thread_number=<optimized out>)
at /home/user/ros2_iron/src/ros2/rclcpp/rclcpp/src/rclcpp/executors/multi_threaded_executor.cpp:92
#18 0x00007ffff72e6793 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#19 0x00007ffff6e94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#20 0x00007ffff6f26660 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
Any updates since January? I can trigger this bug just by spinning at something like 200 Hz an rclcpp::Timer inside a node loaded in a MultiThreadedExecutor.
Update: as per last @Barry-Xu-2018 post in this thread, I believe it's more a problem at rclcpp level because I just tried with CycloneDDS and I hit the same issue.
@roncapat
I can trigger this bug just by spinning at something like 200 Hz an rclcpp::Timer inside a node loaded in a MultiThreadedExecutor.
can you provide the reproducible example? i cannot reproduce the issue with https://github.com/fujitatomoya/ros2_test_prover/commit/db4a5abe375cdb088e46931f843230ba558868aa
i think we need add and remove nodes to&from the executor to make it happen?
I believe it's more a problem at rclcpp level because I just tried with CycloneDDS and I hit the same issue.
thanks for checking, once it is confirmed, i guess we can move this issue to rclcpp.
I don't know why current implementation don't use shared pointer to save GuardCondition.
me neither, this is introduced on https://github.com/ros2/rclcpp/pull/1612. (in particular https://github.com/ros2/rclcpp/compare/b9bec69377dd3850c6db519a7af80a2eb1a9be31..2fc5501a09a17d286f9a68fa1e7bff4f8703cf1e) before addressing this, i want to confirm the problem 1st with reproducible environment if possible.
@fujitatomoya see linked rclcpp issue: https://github.com/ros2/rclcpp/issues/2455
It all started from there for me. There is attached a minimum example.
@sea-bass i believe that would be worth to give it a shot with current rolling branch, that includes https://github.com/ros2/rclcpp/pull/2142.
CC: @mjcarroll
@sea-bass i believe that would be worth to give it a shot with current rolling branch, that includes ros2/rclcpp#2142.
CC: @mjcarroll
Thank you for the update, and for the actual fix! This was a big PR.
@ezrabrooks @mikewrock Putting this on your radar if you guys are at some point updating MoveIt Pro to 24.04 at some point and would like to try FastDDS.