rmw_fastrtps
rmw_fastrtps copied to clipboard
Segmentation fault in in rmw_fastrtps layer when constructing an rclcpp::Node
Bug report
Required Info:
- Operating System: Ubuntu 20.04
- Installation type:
- package manager way ie.:
sudo apt install ros-foxy-desktop
- Version or commit hash:
- foxy
- DDS implementation:
- Eprimosa's Fast-RTPS DDS impl.
- Client library (if applicable):
- rclcpp
Problem Description
Hello Everyone! We're having a segfault within rclcpp::Node constructor, and further investigation /callstack lead us here. I'll post the callstack we could extract in the Actual Behaviour section. Do note please, that due to the nature of the issue i doubt it'll resurface just like that upon the source code's copy-paste.
Additional Information.
- We aren't using ROS2 via workspaces... As rclcpp is basically a c++ project, and ament_cmake is cmake with tricks, check repro steps how we pulled in ros libs into the project
- You'll see from the example code section that this issue is very baffling. Crash varies depending on some code we shouldn't even hit and how much we "uncomment" from the original codebase (investigation steps randomly showed us this)
- We do have another project which uses different libs + adds other business logic before the common codebase and rclcpp::Node constructs perfectly there : no error, no crash.
- In this failing project we do use asio, see debug_deadlocked.txt attachment below as to why this might be relevant(?)
Steps to reproduce issue
CMake commands to include rclcpp
project("dummy")
#... Random libs we would use normally
find_package(rclcpp CONFIG REQUIRED)
ament_target_dependencies(${PROJECT_NAME} PUBLIC rclcpp)
Source c++ code reduced to smallest example
#include <rclcpp/rclcpp.hpp>
int main(int argc, char** argv)
{
rclcpp::init(argc, argv);
rclcpp::Node some_node("TEST_SEGFAULT"); ///< [1]*
rclcpp::shutdown();
return 0; ///< note we aren't even executing any code below this anyway
/*
* Commented out code ///< This is relevant [2]*
*/
}
[1] Causes many things including:
- the segfault if i leave the codebase as is ( no commenting out )
- If I comment the whole code out at [2] out i can observe this error thrown from within Node(string) constructor: exception_unwind.txt
- If I leave some code of the original logic uncommented (some basic things like logging in a separate function, calling that fn, asio(!) threading etc ) [1] seemingly deadlocks, here's the callstack for that when i inpest it via debugger: debug_deadlocked.txt
[2] is project related code. Segfault only happens in this project (as mentioned in additional inform.: we do have another project structured differently, where this doesn't fail.. at all)
Expected behavior
Out-of-the-box installed ros2 foxy works well... out-of-the-box...
Actual behavior
Callstack of the original segfault: callstack.txt
We aren't using ROS2 via workspaces
can you use ROS 2 colcon workspace to see if this problem is reproducible? at least, i do not meet anything like this.
We can certainly try it with workspaces but 1) it wouldn't solve our problem in the long run, 2) Looking at the weird behavior with the commented out "dead" code after the return 0; I wouldn't expect this to be reproducible for anyone just by copy-pasting the ros code snippet.
One more weird detail: In the beginning of January this worked without an issue. Now I even went back to that exact commit in our repository and that fails with the above too. The ros2 installation is newer, and maybe some 3rdparties too due to auto-update of Linux.
So if you don't mind, while also trying it out with colcon, internally we'll go in this 3rdparty interference debug way, and tell you what solution we arrived at, so others - if any - will have a silver lining if they encounter this.
Hello there again! Soo we've solved the issue on our end when going off on this 3rdparty clash tangent. TL;DR: The solution was to switch the RMW implementation (chose cyclone dds for this instance but rly anything would've been fine):
sudo apt install ros-foxy-rmw-cyclonedds-cpp
export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
The issue - we think
-
The main difference between our 2 project (the crash vs the working) was the first few layers of how the inner classes are linked together into a coherent system. In the crashing project we did use asio which we built ourselves
-
Now if you look at debug_deadlocked.txt FastDDS impl actually deadlocked in asio's random mutex, which was weird, but it definitely showed us it is linked by that RMW impl.
-
Since we've had issues with asio multilinkage in the past we went ahead and tried changing the rmw impl and voala~ works like a charm. Not a single error randomly thrown in a lower layer call, let alone a segfault.
*The reason i wrote 'we think', is because I didn't trace ros calls through the whole stack trace + hidden symbols, we had no idea how & what libs ros should find anyway. The deadlock showed us enough to reasonable try this solution, but the answer might not be as helpful like this.
@AegisField
1st, good to know that you did work-around the problem.
Now if you look at debug_deadlocked.txt FastDDS impl actually deadlocked in asio's random mutex
I will not say this is deadlock, what we are looking at is single thread call stack. i do no think the same mutex lock is trying to be acquired in this thread stack. So this can be occurred in a multithreaded program. if you want to get down to the bottom, i believe that we need to see the callstack who is taking the same locks in reverse order.
I would like to add I was getting the same issue with Ros Humble on Ubuntu 22.04. I have a similar setup working outside of colcon workspace (I have to use scons), though I was using rclc.
Switching to cyclone-dds seems to have addressed my issues as well.