rmw_fastrtps
rmw_fastrtps copied to clipboard
Segmentation fault when adding a short-lived node to an executor
Bug report
Description
I am trying to set up an executor "spinner thread" that will spin a specified rclcpp::Executor until commanded to stop. My goal is to construct the executor and spinner thread in one scope then add nodes to that executor in another scope. However, the rclcpp::Executor classes cause a segmentation fault under certain conditions. There appears to be a race condition in the RMW layer that manifests with short-lived nodes.
-
Operating System: Ubuntu 22.04 (Docker container); Ubuntu 22.04 host
-
Installation type: Binaries
-
Version or commit hash:
-
ros-humble-rclcpp: 16.0.10
-
ros-humble-rmw-fastrtps-cpp: 6.2.7
-
ros-humble-fastrtps: 2.6.8
-
ros-humble-rmw-cyclonedds-cpp: 0.10.4
-
ros-humble-cyclonedds: 1.3.4
-
-
DDS implementation:
- Fast-DDS (Fast-RTPS)
- Cyclone DDS
-
Client library (if applicable):
rclcpp -
Compiler:
- Clang 14
- GCC 11.4.0
Steps to reproduce issue
Minimal reproducible example
#include <atomic>
#include <memory>
#include <thread>
#include <rclcpp/executors/multi_threaded_executor.hpp>
#include <rclcpp/node.hpp>
auto function() -> void
{
rclcpp::init(0, nullptr);
rclcpp::executors::MultiThreadedExecutor executor;
std::atomic_bool stop_spinning{false};
auto spinner_thread{std::thread([&] {
while (rclcpp::ok() && !stop_spinning.load()) {
executor.spin_some();
}
})};
{
auto node{std::make_shared<rclcpp::Node>("node")};
executor.add_node(node);
}
stop_spinning.store(true);
spinner_thread.join();
rclcpp::shutdown();
}
auto main() -> int
{
while (true) {
function();
}
return 0;
}
Steps
- Compile and run the above example.
- Wait for a segmentation fault error. It usually happens within a minute.
Expected behavior
The program runs indefinitely.
Actual behavior
The program exits due to a segmentation fault.
Program output when compiled with Clang
Segmentation fault (core dumped)
Program output when compiled with GCC
Segmentation fault (core dumped)
In (only) one of the many runs I did, I got the following output (when compiled with GCC):
terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
what(): failed to add guard condition to wait set: guard condition implementation is invalid, at ./src/rcl/guard_condition.c:172, at ./src/rcl/wait.c:460
Aborted (core dumped)
Additional information
Relevant core dump
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f4154232e3f in rmw_fastrtps_shared_cpp::__rmw_wait(char const*, rmw_subscriptions_s*, rmw_guard_conditions_s*, rmw_services_s*, rmw_clients_s*, rmw_events_s*, rmw_wait_set_s*, rmw_time_s const*) ()
from /opt/ros/humble/lib/librmw_fastrtps_shared_cpp.so
[Current thread is 1 (Thread 0x7f41531e8640 (LWP 1202848))]
(gdb) bt
#0 0x00007f4154232e3f in rmw_fastrtps_shared_cpp::__rmw_wait(char const*, rmw_subscriptions_s*, rmw_guard_conditions_s*, rmw_services_s*, rmw_clients_s*, rmw_events_s*, rmw_wait_set_s*, rmw_time_s const*) ()
at /opt/ros/humble/lib/librmw_fastrtps_shared_cpp.so
#1 0x00007f415428a887 in rmw_wait(rmw_subscriptions_t*, rmw_guard_conditions_t*, rmw_services_t*, rmw_clients_t*, rmw_events_t*, rmw_wait_set_t*, rmw_time_t const*)
(subscriptions=<optimized out>, guard_conditions=<optimized out>, services=<optimized out>, clients=<optimized out>, events=<optimized out>, wait_set=<optimized out>, wait_timeout=0x7f41531d5510) at ./src/rmw_wait.cpp:33
#2 0x00007f42759b1848 in rcl_wait () at /opt/ros/humble/lib/librcl.so
#3 0x00007f4274f2470c in rclcpp::Executor::wait_for_work(std::chrono::duration<long, std::ratio<1l, 1000000000l> >) (this=0x7ffdbe823c50, timeout=...) at /usr/include/c++/11/chrono:521
#4 0x00007f4274f28853 in rclcpp::Executor::spin_some_impl(std::chrono::duration<long, std::ratio<1l, 1000000000l> >, bool) (this=0x7ffdbe823c50, max_duration=..., exhaustive=false) at /usr/include/c++/11/chrono:537
#5 0x000063ca4387dbde in function()::$_0::operator()() const (this=0x63ca44fc84e8) at <redacted_executable_path>:17
#6 0x000063ca4387db1d in std::__invoke_impl<void, function()::$_0>(std::__invoke_other, function()::$_0&&) (__f=...) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:61
#7 0x000063ca4387daad in std::__invoke<function()::$_0>(function()::$_0&&) (__fn=...) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:96
#8 0x000063ca4387da85 in std::thread::_Invoker<std::tuple<function()::$_0> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (this=0x63ca44fc84e8) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:259
#9 0x000063ca4387da55 in std::thread::_Invoker<std::tuple<function()::$_0> >::operator()() (this=0x63ca44fc84e8) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:266
#10 0x000063ca4387d979 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<function()::$_0> > >::_M_run() (this=0x63ca44fc84e0) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:211
#11 0x00007f4273705253 in () at /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#12 0x00007f427338dac3 in () at /usr/lib/x86_64-linux-gnu/libc.so.6
#13 0x00007f427341f850 in () at /usr/lib/x86_64-linux-gnu/libc.so.6
Other notes
-
The issue occurs when using either the
SingleThreadedExecutororMultiThreadedExecutor. -
No segmentation fault occurs* when relocating
nodeto the outer scope.auto node{std::make_shared<rclcpp::Node>("node")}; { executor.add_node(node); } -
No segmentation fault occurs* when using Cyclone DDS
-
No segmentation fault occurs* when sleeping the main thread for a short duration before adding
nodeto the executor.{ auto node{std::make_shared<rclcpp::Node>("node")}; std::this_thread::sleep_for(std::chrono::seconds(1)); executor.add_node(node); }
* I concluded a segmentation fault was unlikely after running the program for several minutes.
I am not sure if this is fixed with specific PR, but this problem cannot be observed with rolling.
Can you try removing the node from the executor before it goes out of scope?
I am not sure if this is fixed with specific PR, but this problem cannot be observed with rolling.
@fujitatomoya I ran the example code in the latest Rolling Docker image (rolling-ros-base, image ID/digest f875aa41a6f0) and got a segfault there too.
Container start command:
docker run --rm -it ros:rolling-ros-base
The only changes I made inside the container were:
- updating apt's package list (not upgrading the packages themselves)
- installing Vim
- creating a workspace at
/workspace/src - creating an C++ package with
ros2 pkg create - adding the example code from above
Is there anything else I can provide to help pinpoint where the issue might be?
Can you try removing the node from the executor before it goes out of scope?
@christophebedard I made the following change to the example.
{
auto node{std::make_shared<rclcpp::Node>("node")};
executor.add_node(node);
executor.remove_node(node);
}
The program ran fine for over 10 minutes running on Humble with FastDDS.
Can you try removing the node from the executor before it goes out of scope?
@christophebedard I made the following change to the example.
{ auto node{std::make_shared<rclcpp::Node>("node")}; executor.add_node(node); executor.remove_node(node); }The program ran fine for over 10 minutes running on Humble with FastDDS.
I think it's reasonable to expect the user to remove the node from the executor before it goes out of scope/gets destroyed. See this test which does something similar to your reproducer: https://github.com/ros2/rclcpp/blob/2f71d6e249f626772da3f8a1bb7c8d141d9d0d52/rclcpp/test/rclcpp/test_executor.cpp#L75-L102
I'm not sure if this would cover your full use-case, so please let us know.
@adamlm thanks for checking this! actually what i tried is to source build with rolling. so maybe latest patches in rolling source (not available as package) fix this issue.
We're having a similar issue when creating a short-lived subscriber. I think it may be the same since:
- We're creating and attaching the subscriber within the scope of a method (specifically an action callback)
- The gdb backtrace is very similar (crashes within the same place in rclcpp / rmw)
- The method call is very quick (and adding a sleep within it seems to prevent the crash)
- We occasionally get the same rclcpp::exceptions::RCLError
We were originally upgrading from noetic to humble. I noticed https://github.com/ros2/rmw_fastrtps/issues/728, which seemed to be resolved by https://github.com/ros2/rclcpp/pull/2142. This PR appears to be in jazzy, however, upgrading to jazzy has not fixed the issue for us.