rclcpp icon indicating copy to clipboard operation
rclcpp copied to clipboard

Segment fault occurred when node was added to executor if the callback group variable created was local

Open Chaochao0215 opened this issue 1 year ago • 13 comments
trafficstars

ROS2

Bug report

Required Info:

  • Operating System:
    • Ubuntu 22.04
  • Installation type:
    • from source
  • Version or commit hash:
    • humble release patch 5
  • DDS implementation:
    • Fast-RTPS
  • Client library (if applicable):
    • rclcpp

Steps to reproduce issue

Create a node and call the create_callback_group interface to create a callback group local variable, then add the node to the executor and spin the executor, segment fault occuered sometimes.

#include <memory>
#include <thread>
#include <string>
#include <vector>

#include "rclcpp/executors.hpp"
#include "rclcpp/node.hpp"
#include "rclcpp/timer.hpp"
#include "std_msgs/msg/string.hpp"

int main(int argc, char **argv) {
  rclcpp::init(argc, argv);

  auto node = std::make_shared<rclcpp::Node>("test_callback_group");
  auto sub = node->create_subscription<std_msgs::msg::String>(
      "/test_callback_group", 10,
      [](const std_msgs::msg::String::ConstSharedPtr) {});
  rclcpp::executors::MultiThreadedExecutor executor(rclcpp::ExecutorOptions(),
                                                    3u);
  std::thread thr;
  {
    auto group = node->create_callback_group(
        rclcpp::CallbackGroupType::MutuallyExclusive);
    executor.add_node(node);
    thr = std::thread([&executor]() { executor.spin(); });
    std::cout << "wait for executor running..." << std::endl;
  } // executor wait on dds and then destory the callback group

  while (rclcpp::ok()) {
    std::this_thread::sleep_for(std::chrono::seconds(1));
  }
  rclcpp::shutdown();

  if (thr.joinable()) {
    thr.join();
  }

  rclcpp::shutdown();
  return 0;
} 

#### Expected behavior
process runs normally
#### Actual behavior
segment fault
#### Additional information
It seems to be related to the lifetime of the condition in the callback group.

Chaochao0215 avatar Mar 05 '24 07:03 Chaochao0215

Hello @Chaochao0215, maybe we could build a debug type of this source code and use gdb to backtrace when it crashed

Zard-C avatar Mar 05 '24 08:03 Zard-C

I tried your code, but I cannot find segment fault.
My test environment is based on ros2 Debian package (ros-humble-rclcpp 16.0.8-1jammy.20240217.065743).

Barry-Xu-2018 avatar Mar 05 '24 09:03 Barry-Xu-2018

Hello @Chaochao0215, maybe we could build a debug type of this source code and use gdb to backtrace when it crashed Glad to receive your reply, this issue reoccuered and i pasted the backtrace information HowHgpAxRD

Chaochao0215 avatar Mar 05 '24 09:03 Chaochao0215

I tried your code and I can't reproduce the segment fault. However, in gdb backtrace it seems like a multi-thread problem, rebuild your program with

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsanitize=thread -g")

and check if there's data race in ThreadSanitizer report.

Zard-C avatar Mar 05 '24 16:03 Zard-C

I tried your code and I can't reproduce the segment fault. However, in gdb backtrace it seems like a multi-thread problem, rebuild your program with

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsanitize=thread -g")

and check if there's data race in ThreadSanitizer report.

hi, I found a quick way to reproduce the problem

  1. set breakpoint 1 at line 21 of the test file
  2. set breakpoint 2 at line 25 of the test file
  3. set breakpoint 3 at " rcl_ret_t status = rcl_wait(&wait_set_, std::chrono::duration_caststd::chrono::nanoseconds(timeout).count());" in executor.cpp
  4. set gdb to not stop when hit breakpoints , then run the test
  5. after breakpoints 1 && 3 both hit , continue the main thread and wait to hit breakpoint 2
  6. after breakpoint 2 hit, send the message subscribed in the new shell terminal, such as 'ros2 topic pub /test_callback_group std_msgs/msg/String "data: hello"' and continue the thread interrupted at breakpoint 3 , segment fault should occurred

Chaochao0215 avatar Mar 06 '24 02:03 Chaochao0215

6. send the message subscribed in the new shell terminal, such as 'ros2 topic pub /test_callback_group std_msgs/msg/String "data: hello"

I missed this step.

Zard-C avatar Mar 06 '24 09:03 Zard-C

  1. send the message subscribed in the new shell terminal, such as 'ros2 topic pub /test_callback_group std_msgs/msg/String "data: hello"

I missed this step.

I'm sorry i didn't specify the need to publish message at the beginning, hope you reproduce it successfully.

Chaochao0215 avatar Mar 06 '24 12:03 Chaochao0215

I tried your code, but I cannot find segment fault. My test environment is based on ros2 Debian package (ros-humble-rclcpp 16.0.8-1jammy.20240217.065743). Thank you for your attention, it occurs with low probability and I built ROS2 from source so that it can be controlled through gdb

Chaochao0215 avatar Mar 07 '24 02:03 Chaochao0215

@Chaochao0215 it would be helpful if you could provide self-contained reproducible test.

fujitatomoya avatar Mar 14 '24 21:03 fujitatomoya

@Chaochao0215 Did you find the cause of this problem in the end? How did you solve it? I also encountered the same problem when using MultiThreadedExecutor, The program stack is as follows

Program terminated with signal SIGSEGV, Segmentation fault. #0 0x0000007f874cdac4 in ?? () from /opt/ros/humble/lib/librmw_cyclonedds_cpp.so [Current thread is 1 (Thread 0x7f5f7ee8e0 (LWP 402))] (gdb) bt #0 0x0000007f874cdac4 in ?? () from /opt/ros/humble/lib/librmw_cyclonedds_cpp.so #1 0x0000007f874cfd50 in rmw_wait () from /opt/ros/humble/lib/librmw_cyclonedds_cpp.so #2 0x0000007f87f16c38 in rcl_wait () from /opt/ros/humble/lib/librcl.so #3 0x0000007f88266850 in rclcpp::Executor::wait_for_work(std::chrono::duration<long, std::ratio<1l, 1000000000l> >) () from /opt/ros/humble/lib/librclcpp.so #4 0x0000007f8826989c in rclcpp::Executor::get_next_executable(rclcpp::AnyExecutable&, std::chrono::duration<long, std::ratio<1l, 1000000000l> >) () from /opt/ros/humble/lib/librclcpp.so #5 0x0000007f88270d38 in rclcpp::executors::MultiThreadedExecutor::run(unsigned long) () from /opt/ros/humble/lib/librclcpp.so #6 0x0000007f87d531fc in ?? () from /lib/aarch64-linux-gnu/libstdc++.so.6 #7 0x0000007f87b1d5c8 in start_thread (arg=0x0) at ./nptl/pthread_create.c:442 #8 0x0000007f87b85d9c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:79

zxf101826 avatar Jun 24 '24 06:06 zxf101826

@Chaochao0215 Did you find the cause of this problem in the end? How did you solve it? I also encountered the same problem when using MultiThreadedExecutor, The program stack is as follows

Program terminated with signal SIGSEGV, Segmentation fault.

could you give a minimum demo code to regenerate this error ? I tried codes above, but it didn't crash.

Zard-C avatar Jun 24 '24 10:06 Zard-C

@Chaochao0215 Did you find the cause of this problem in the end? How did you solve it? I also encountered the same problem when using MultiThreadedExecutor, The program stack is as follows Program terminated with signal SIGSEGV, Segmentation fault.

could you give a minimum demo code to regenerate this error ? I tried codes above, but it didn't crash.

My code is mixed with a lot of logic, I don't know which call caused it, this crash is quite sporadic, The program only appears once every few days.

zxf101826 avatar Jun 25 '24 02:06 zxf101826

This issue looks similar to https://github.com/ros2/rmw_fastrtps/issues/777.

adamlm avatar Sep 08 '24 18:09 adamlm