ros2_control Possible race condition in controller manager on humble

Describe the bug There is a race condition on the call to rclcpp_lifecycle::State::get_state() in the controller manager.

The get_state() method is not thread-safe as explained in this issue. This was fixed in rolling, but not on humble. So any call to the get_state() method can potentially lead to a std::runtime_error raising the following error:

Error in state! Internal state_handle is NULL.

To Reproduce It is hard to reproduce, because it happens randomly. One of the controllers that we implemented used the same get_state() method which triggered the error and crashed the controller_manager.

But basically any call to get_state() can lead to that issue. There are three "paths" of calls that can lead to this method being called:

any service call, but since they are mutually exclusive thanks to the service mutex, we can consider it as one path
diagnostic updater
update method

bug-ros2-control-thread

We have crafted a node that is a copy of the ros2_control_node, where we purposely call the list_controllers_service_cb in a separate thread, many times to trigger the crash.

Here is a git diff of the changes that you can apply to this repo to reproduce the crash. reproductible_bug.diff.txt

note that I had to add the .txt extension for github to accept uploading the file

Expected behavior No crash. We might need to protect the call to get_state() with yet another mutex, but the best solution would be for rclcpp_lifecycle to backport their fix to humble.

Environment (please complete the following information):

OS: ubuntu 22.04
Version humble

Mar 30 '23 10:03 ejalaa12

I can verify I am facing the same issue on Humble 22.04 with the latest packages of this month.

No custom controller, just using joint_trajectory_controller/JointTrajectoryController!

Bellow is the stacktrace after it crashes. It happens very often (almost 50%) of the time but randomly.

[ros2_control_node-7] terminate called after throwing an instance of 'std::runtime_error'
[ros2_control_node-7]   what():  Error in state! Internal state_handle is NULL.
[ros2_control_node-7] Stack trace (most recent call last) in thread 167431:
[ros2_control_node-7] #16   Object "", at 0xffffffffffffffff, in 
[ros2_control_node-7] #15   Source "../sysdeps/unix/sysv/linux/x86_64/clone.S", line 100, in __clone [0x7f6b1284dbb3]
[ros2_control_node-7] #14   Source "./nptl/pthread_create.c", line 442, in start_thread [0x7f6b127bcb42]
[ros2_control_node-7] #13   Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f6b12a4c2b2, in 
[ros2_control_node-7] #12   Object "/opt/ros/humble/lib/librclcpp.so", at 0x7f6b12c89909, in rclcpp::executors::MultiThreadedExecutor::run(unsigned long)
[ros2_control_node-7] #11   Object "/opt/ros/humble/lib/librclcpp.so", at 0x7f6b12c82650, in rclcpp::Executor::execute_any_executable(rclcpp::AnyExecutable&)
[ros2_control_node-7] #10   Object "/opt/ros/humble/lib/libcontroller_manager.so", at 0x7f6b12e56c94, in 
[ros2_control_node-7] #9    Object "/opt/ros/humble/lib/libcontroller_manager.so", at 0x7f6b12e0f22c, in diagnostic_updater::Updater::update()
[ros2_control_node-7] #8    Object "/opt/ros/humble/lib/libcontroller_manager.so", at 0x7f6b12e23748, in controller_manager::ControllerManager::controller_activity_diagnostic_callback(diagnostic_updater::DiagnosticStatusWrapper&)
[ros2_control_node-7] #7    Object "/opt/ros/humble/lib/librclcpp_lifecycle.so", at 0x7f6b1257f981, in 
[ros2_control_node-7] #6    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f6b12a1e517, in __cxa_throw
[ros2_control_node-7] #5    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f6b12a1e2b6, in std::terminate()
[ros2_control_node-7] #4    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f6b12a1e24b, in 
[ros2_control_node-7] #3    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f6b12a12bbd, in 
[ros2_control_node-7] #2    Source "./stdlib/abort.c", line 79, in abort [0x7f6b127507f2]
[ros2_control_node-7] #1    Source "../sysdeps/posix/raise.c", line 26, in raise [0x7f6b1276a475]
[ros2_control_node-7] #0  | Source "./nptl/pthread_kill.c", line 89, in __pthread_kill_internal
[ros2_control_node-7]     | Source "./nptl/pthread_kill.c", line 78, in __pthread_kill_implementation
[ros2_control_node-7]       Source "./nptl/pthread_kill.c", line 44, in __pthread_kill [0x7f6b127bea7c]
[ros2_control_node-7] Aborted (Signal sent by tkill() 167250 1000)
[ERROR] [ros2_control_node-7]: process has died [pid 167250, exit code -6, cmd '/opt/ros/humble/lib/controller_manager/ros2_control_node --ros-ar...

Mar 30 '23 18:03 DatSpace

Unfortunately I asked in https://github.com/ros2/rclcpp/pull/1756 and there won't be a backport, seems it is an ABI-breaking fix.

Mar 30 '23 20:03 roncapat

@DatSpace how did you reproduce this ? @roncapat I didn't notice that you asked for the backport, that's a shame. It needs users need to protect the access to this method, which is nor ideal, or safe.

Mar 31 '23 07:03 ejalaa12

I just encountered the same issue when soak testing my robot (running the robot non stop for days) this is the output of the terminal.

[ros2_control_node-1] terminate called after throwing an instance of 'std::runtime_error'
[ros2_control_node-1]   what():  Error in state! Internal state_handle is NULL.
[ros2_control_node-1] Stack trace (most recent call last) in thread 44154:
[ros2_control_node-1] #14   Object "/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1", at 0xffffffffffffffff, in 
[ros2_control_node-1] #13   Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0x7f96385d1b, in 
[ros2_control_node-1] #12   Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0x7f9631d5c7, in 
[ros2_control_node-1] #11   Object "/usr/lib/aarch64-linux-gnu/libstdc++.so.6.0.30", at 0x7f965531fb, in 
[ros2_control_node-1] #10   Object "/opt/ros/humble/lib/controller_manager/ros2_control_node", at 0x557944384b, in 
[ros2_control_node-1] #9    Object "/opt/ros/humble/lib/libcontroller_manager.so", at 0x7f9695e3af, in controller_manager::ControllerManager::update(rclcpp::Time const&, rclcpp::Duration const&)
[ros2_control_node-1] #8    Object "/opt/ros/humble/lib/libdiff_drive_controller.so", at 0x7f904cb64f, in diff_drive_controller::DiffDriveController::update(rclcpp::Time const&, rclcpp::Duration const&)
[ros2_control_node-1] #7    Object "/opt/ros/humble/lib/librclcpp_lifecycle.so", at 0x7f95ffc5bb, in rclcpp_lifecycle::State::id() const
[ros2_control_node-1] #6    Object "/usr/lib/aarch64-linux-gnu/libstdc++.so.6.0.30", at 0x7f96522d23, in __cxa_throw
[ros2_control_node-1] #5    Object "/usr/lib/aarch64-linux-gnu/libstdc++.so.6.0.30", at 0x7f96522a3f, in std::terminate()
[ros2_control_node-1] #4    Object "/usr/lib/aarch64-linux-gnu/libstdc++.so.6.0.30", at 0x7f965229db, in 
[ros2_control_node-1] #3    Object "/usr/lib/aarch64-linux-gnu/libstdc++.so.6.0.30", at 0x7f965251fb, in __gnu_cxx::__verbose_terminate_handler()
[ros2_control_node-1] #2    Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0x7f962c712f, in abort
[ros2_control_node-1] #1    Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0x7f962da67b, in raise
[ros2_control_node-1] #0    Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0x7f9631f200, in 
[ros2_control_node-1] Aborted (Signal sent by tkill() 43824 0)

Is there any way to fix this? Using Humble 22.04

Apr 18 '23 12:04 pepeRossRobotics

Quick follow up question. In my case respawining the ros2_control_node when it crashes would fix the issue, but because I am using the controller_manager spawner to do the join_state_broadcaster and the base_controller of my robot, which they "die" after loading the controller, respawning the ros2_control_node does not fully does it. Is there a way to write the launch file in python so that if the ros2_control_node dies if relaunches along with the two controller_manager spawners? My current relevant section of my launch files is the following:

    control_node = Node(
        package="controller_manager",
        executable="ros2_control_node",
        parameters=[robot_description, robot_controllers],
        remappings=[("/diagnostics", DIAGNOSTICS_CONTROL_TOPIC)],
        output={
            "stdout": "screen",
            "stderr": "screen",
        },
        respawn = True,
        respawn_delay = 2
    )

    joint_state_broadcaster_spawner = Node(
        package="controller_manager",
        executable="spawner",
        arguments=["joint_state_broadcaster",
                   "-c",
                   "/controller_manager",
                   "--controller-manager-timeout",
                   "10"]
    )

    robot_controller_spawner = Node(
        package="controller_manager",
        executable="spawner",
        arguments=["base_controller",
                   "-c",
                   "/controller_manager",
                   "--controller-manager-timeout",
                   "10"]
    )

Apr 18 '23 12:04 pepeRossRobotics

@pepeRossRobotics The only way we found to prevent the crash from happening is forking ros2_control and surround the place where the problematic method is called with a while/try/catch but that is not a good way to do it.

Concerning the launch file, I believe you can use event handlers to start the "spawner" when the controller_manager process starts. I haven't tried it myself, but I think it should work.

Apr 20 '23 06:04 ejalaa12

Many thanks for the help, I assumed something like this was possible, but didn't know how to implement them. I will give them a go now.

Apr 20 '23 06:04 pepeRossRobotics

Concerning the launch file, I believe you can use event handlers to start the "spawner" when the controller_manager process starts. I haven't tried it myself, but I think it should work.

There is some of those in our example code. Maybe you find this useful: https://github.com/StoglRobotics/ros_team_workspace/blob/master/templates/ros2_control/robot_ros2_control.launch.py#L186

Apr 20 '23 16:04 destogl

Concerning the launch file, I believe you can use event handlers to start the "spawner" when the controller_manager process starts. I haven't tried it myself, but I think it should work.

There is some of those in our example code. Maybe you find this useful: https://github.com/StoglRobotics/ros_team_workspace/blob/master/templates/ros2_control/robot_ros2_control.launch.py#L186

this is really helpful, many thanks!

Apr 24 '23 07:04 pepeRossRobotics

Just wanted to add another stack trace to the pile. This is also on Humble on Ubuntu 22.04 LTS.

[ros2_control_node-1] terminate called after throwing an instance of 'std::runtime_error'
[ros2_control_node-1]   what():  Error in state! Internal state_handle is NULL.
[ros2_control_node-1] Stack trace (most recent call last) in thread 178:
[ros2_control_node-1] #13   Object "", at 0xffffffffffffffff, in 
[ros2_control_node-1] #12   Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fead00ce9ff, in 
[ros2_control_node-1] #11   Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fead003cb42, in 
[ros2_control_node-1] #10   Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fead02cc2b2, in 
[ros2_control_node-1] #9    Object "/opt/ros/humble/lib/controller_manager/ros2_control_node", at 0x564dbe0793db, in 
[ros2_control_node-1] #8    Object "/opt/ros/humble/lib/libcontroller_manager.so", at 0x7fead069f61c, in controller_manager::ControllerManager::update(rclcpp::Time const&, rclcpp::Duration const&)
[ros2_control_node-1] #7    Object "/opt/ros/humble/lib/librclcpp_lifecycle.so", at 0x7feacfe3d981, in 
[ros2_control_node-1] #6    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fead029e517, in __cxa_throw
[ros2_control_node-1] #5    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fead029e2b6, in std::terminate()
[ros2_control_node-1] #4    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fead029e24b, in 
[ros2_control_node-1] #3    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fead0292bbd, in 
[ros2_control_node-1] #2    Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7feacffd07f2, in abort
[ros2_control_node-1] #1    Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7feacffea475, in raise
[ros2_control_node-1] #0    Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fead003ea7c, in pthread_kill
[ros2_control_node-1] Aborted (Signal sent by tkill() 104 0)

Apr 24 '23 21:04 schornakj

Two updates on this:

I made a PR to rclcpp to patch the thread safety issue in the lifecycle node state: https://github.com/ros2/rclcpp/pull/2183. This will probably take a while to get merged and released in Humble.
@EzraBrooks found that this is probably due to a regression that was introduced in #922, which adds a new call to get_state() from within a timer callback. On Humble get_state() isn't thread-safe, so the addition of this callback caused the following release of ros2_control on Humble to segfault.

I think in the near-term #922 should be reverted. While diagnostic info is nice to have, it's not acceptable for the robot control nodes to segfault like this.

May 08 '23 18:05 schornakj

@schornakj Thanks for the work and the PR to rclcpp. Yes we also noticed that the issue was introduced when the diagnostics were added but we weren't sure because that it was the only source of errors. As you can see in the diagram I posted in the first message, there are three call hierarchy that leads to calling get_state():

service callbacks (which are mutually exclusive)
update method
diagnostic timer

We guessed that there is some mechanism that prevents the get_state() method being called from both the update method and service callbacks, but we didn't look to much into it. But, yeah, once you add the diagnostic timer, there is definitely a race with the two other threads.

May 11 '23 17:05 ejalaa12

ros2_control ros2_control copied to clipboard

Possible race condition in controller manager on humble

ros2_control
ros2_control copied to clipboard