rmw_fastrtps Segmentation fault in in rmw_fastrtps layer when constructing an rclcpp::Node

trafficstars

Bug report

Required Info:

Operating System: Ubuntu 20.04
Installation type:
- package manager way ie.:
- sudo apt install ros-foxy-desktop
Version or commit hash:
- foxy
DDS implementation:
- Eprimosa's Fast-RTPS DDS impl.
Client library (if applicable):
- rclcpp

Problem Description

Hello Everyone! We're having a segfault within rclcpp::Node constructor, and further investigation /callstack lead us here. I'll post the callstack we could extract in the Actual Behaviour section. Do note please, that due to the nature of the issue i doubt it'll resurface just like that upon the source code's copy-paste.

Additional Information.

We aren't using ROS2 via workspaces... As rclcpp is basically a c++ project, and ament_cmake is cmake with tricks, check repro steps how we pulled in ros libs into the project
You'll see from the example code section that this issue is very baffling. Crash varies depending on some code we shouldn't even hit and how much we "uncomment" from the original codebase (investigation steps randomly showed us this)
We do have another project which uses different libs + adds other business logic before the common codebase and rclcpp::Node constructs perfectly there : no error, no crash.
In this failing project we do use asio, see debug_deadlocked.txt attachment below as to why this might be relevant(?)

Steps to reproduce issue

CMake commands to include rclcpp

project("dummy")
#...  Random libs we would use normally 
find_package(rclcpp CONFIG REQUIRED)
ament_target_dependencies(${PROJECT_NAME} PUBLIC rclcpp)

Source c++ code reduced to smallest example

#include <rclcpp/rclcpp.hpp>

int main(int argc, char** argv)
{
    rclcpp::init(argc, argv);
    rclcpp::Node some_node("TEST_SEGFAULT"); ///< [1]*
    rclcpp::shutdown();
    return 0; ///< note we aren't even executing any code below this anyway

    /*
     * Commented out code ///< This is relevant [2]*
     */
}

[1] Causes many things including:

the segfault if i leave the codebase as is ( no commenting out )
If I comment the whole code out at [2] out i can observe this error thrown from within Node(string) constructor: exception_unwind.txt
If I leave some code of the original logic uncommented (some basic things like logging in a separate function, calling that fn, asio(!) threading etc ) [1] seemingly deadlocks, here's the callstack for that when i inpest it via debugger: debug_deadlocked.txt

[2] is project related code. Segfault only happens in this project (as mentioned in additional inform.: we do have another project structured differently, where this doesn't fail.. at all)

Expected behavior

Out-of-the-box installed ros2 foxy works well... out-of-the-box...

Actual behavior

Callstack of the original segfault: callstack.txt

Feb 11 '22 17:02 AegisField

We aren't using ROS2 via workspaces

can you use ROS 2 colcon workspace to see if this problem is reproducible? at least, i do not meet anything like this.

Feb 11 '22 20:02 fujitatomoya

We can certainly try it with workspaces but 1) it wouldn't solve our problem in the long run, 2) Looking at the weird behavior with the commented out "dead" code after the return 0; I wouldn't expect this to be reproducible for anyone just by copy-pasting the ros code snippet.

One more weird detail: In the beginning of January this worked without an issue. Now I even went back to that exact commit in our repository and that fails with the above too. The ros2 installation is newer, and maybe some 3rdparties too due to auto-update of Linux.

So if you don't mind, while also trying it out with colcon, internally we'll go in this 3rdparty interference debug way, and tell you what solution we arrived at, so others - if any - will have a silver lining if they encounter this.

Feb 14 '22 09:02 AegisField

Hello there again! Soo we've solved the issue on our end when going off on this 3rdparty clash tangent. TL;DR: The solution was to switch the RMW implementation (chose cyclone dds for this instance but rly anything would've been fine):

sudo apt install ros-foxy-rmw-cyclonedds-cpp
export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp

The issue - we think

The main difference between our 2 project (the crash vs the working) was the first few layers of how the inner classes are linked together into a coherent system. In the crashing project we did use asio which we built ourselves
Now if you look at debug_deadlocked.txt FastDDS impl actually deadlocked in asio's random mutex, which was weird, but it definitely showed us it is linked by that RMW impl.
Since we've had issues with asio multilinkage in the past we went ahead and tried changing the rmw impl and voala~ works like a charm. Not a single error randomly thrown in a lower layer call, let alone a segfault.

*The reason i wrote 'we think', is because I didn't trace ros calls through the whole stack trace + hidden symbols, we had no idea how & what libs ros should find anyway. The deadlock showed us enough to reasonable try this solution, but the answer might not be as helpful like this.

Feb 17 '22 13:02 AegisField

@AegisField

1st, good to know that you did work-around the problem.

Now if you look at debug_deadlocked.txt FastDDS impl actually deadlocked in asio's random mutex

I will not say this is deadlock, what we are looking at is single thread call stack. i do no think the same mutex lock is trying to be acquired in this thread stack. So this can be occurred in a multithreaded program. if you want to get down to the bottom, i believe that we need to see the callstack who is taking the same locks in reverse order.

Feb 18 '22 16:02 fujitatomoya

I would like to add I was getting the same issue with Ros Humble on Ubuntu 22.04. I have a similar setup working outside of colcon workspace (I have to use scons), though I was using rclc.

Switching to cyclone-dds seems to have addressed my issues as well.

Jul 18 '23 04:07 OffsetMOSFET

rmw_fastrtps rmw_fastrtps copied to clipboard

Segmentation fault in in rmw_fastrtps layer when constructing an rclcpp::Node

Bug report

Problem Description

Additional Information.

Steps to reproduce issue

Expected behavior

Actual behavior

rmw_fastrtps
rmw_fastrtps copied to clipboard