ros2_control
ros2_control copied to clipboard
Possible race condition in controller manager on humble
Describe the bug
There is a race condition on the call to rclcpp_lifecycle::State::get_state()
in the controller manager.
The get_state()
method is not thread-safe as explained in this issue. This was fixed in rolling, but not on humble.
So any call to the get_state() method can potentially lead to a std::runtime_error
raising the following error:
Error in state! Internal state_handle is NULL.
To Reproduce
It is hard to reproduce, because it happens randomly.
One of the controllers that we implemented used the same get_state()
method which triggered the error and crashed the controller_manager.
But basically any call to get_state()
can lead to that issue.
There are three "paths" of calls that can lead to this method being called:
- any service call, but since they are mutually exclusive thanks to the service mutex, we can consider it as one path
- diagnostic updater
-
update
method
We have crafted a node that is a copy of the ros2_control_node
, where we purposely call the list_controllers_service_cb
in a separate thread, many times to trigger the crash.
Here is a git diff of the changes that you can apply to this repo to reproduce the crash. reproductible_bug.diff.txt
note that I had to add the .txt extension for github to accept uploading the file
Expected behavior
No crash. We might need to protect the call to get_state()
with yet another mutex, but the best solution would be for rclcpp_lifecycle to backport their fix to humble.
Environment (please complete the following information):
- OS: ubuntu 22.04
- Version humble
I can verify I am facing the same issue on Humble 22.04 with the latest packages of this month.
No custom controller, just using joint_trajectory_controller/JointTrajectoryController
!
Bellow is the stacktrace after it crashes. It happens very often (almost 50%) of the time but randomly.
[ros2_control_node-7] terminate called after throwing an instance of 'std::runtime_error'
[ros2_control_node-7] what(): Error in state! Internal state_handle is NULL.
[ros2_control_node-7] Stack trace (most recent call last) in thread 167431:
[ros2_control_node-7] #16 Object "", at 0xffffffffffffffff, in
[ros2_control_node-7] #15 Source "../sysdeps/unix/sysv/linux/x86_64/clone.S", line 100, in __clone [0x7f6b1284dbb3]
[ros2_control_node-7] #14 Source "./nptl/pthread_create.c", line 442, in start_thread [0x7f6b127bcb42]
[ros2_control_node-7] #13 Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f6b12a4c2b2, in
[ros2_control_node-7] #12 Object "/opt/ros/humble/lib/librclcpp.so", at 0x7f6b12c89909, in rclcpp::executors::MultiThreadedExecutor::run(unsigned long)
[ros2_control_node-7] #11 Object "/opt/ros/humble/lib/librclcpp.so", at 0x7f6b12c82650, in rclcpp::Executor::execute_any_executable(rclcpp::AnyExecutable&)
[ros2_control_node-7] #10 Object "/opt/ros/humble/lib/libcontroller_manager.so", at 0x7f6b12e56c94, in
[ros2_control_node-7] #9 Object "/opt/ros/humble/lib/libcontroller_manager.so", at 0x7f6b12e0f22c, in diagnostic_updater::Updater::update()
[ros2_control_node-7] #8 Object "/opt/ros/humble/lib/libcontroller_manager.so", at 0x7f6b12e23748, in controller_manager::ControllerManager::controller_activity_diagnostic_callback(diagnostic_updater::DiagnosticStatusWrapper&)
[ros2_control_node-7] #7 Object "/opt/ros/humble/lib/librclcpp_lifecycle.so", at 0x7f6b1257f981, in
[ros2_control_node-7] #6 Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f6b12a1e517, in __cxa_throw
[ros2_control_node-7] #5 Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f6b12a1e2b6, in std::terminate()
[ros2_control_node-7] #4 Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f6b12a1e24b, in
[ros2_control_node-7] #3 Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f6b12a12bbd, in
[ros2_control_node-7] #2 Source "./stdlib/abort.c", line 79, in abort [0x7f6b127507f2]
[ros2_control_node-7] #1 Source "../sysdeps/posix/raise.c", line 26, in raise [0x7f6b1276a475]
[ros2_control_node-7] #0 | Source "./nptl/pthread_kill.c", line 89, in __pthread_kill_internal
[ros2_control_node-7] | Source "./nptl/pthread_kill.c", line 78, in __pthread_kill_implementation
[ros2_control_node-7] Source "./nptl/pthread_kill.c", line 44, in __pthread_kill [0x7f6b127bea7c]
[ros2_control_node-7] Aborted (Signal sent by tkill() 167250 1000)
[ERROR] [ros2_control_node-7]: process has died [pid 167250, exit code -6, cmd '/opt/ros/humble/lib/controller_manager/ros2_control_node --ros-ar...
Unfortunately I asked in https://github.com/ros2/rclcpp/pull/1756 and there won't be a backport, seems it is an ABI-breaking fix.
@DatSpace how did you reproduce this ? @roncapat I didn't notice that you asked for the backport, that's a shame. It needs users need to protect the access to this method, which is nor ideal, or safe.
I just encountered the same issue when soak testing my robot (running the robot non stop for days) this is the output of the terminal.
[ros2_control_node-1] terminate called after throwing an instance of 'std::runtime_error'
[ros2_control_node-1] what(): Error in state! Internal state_handle is NULL.
[ros2_control_node-1] Stack trace (most recent call last) in thread 44154:
[ros2_control_node-1] #14 Object "/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1", at 0xffffffffffffffff, in
[ros2_control_node-1] #13 Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0x7f96385d1b, in
[ros2_control_node-1] #12 Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0x7f9631d5c7, in
[ros2_control_node-1] #11 Object "/usr/lib/aarch64-linux-gnu/libstdc++.so.6.0.30", at 0x7f965531fb, in
[ros2_control_node-1] #10 Object "/opt/ros/humble/lib/controller_manager/ros2_control_node", at 0x557944384b, in
[ros2_control_node-1] #9 Object "/opt/ros/humble/lib/libcontroller_manager.so", at 0x7f9695e3af, in controller_manager::ControllerManager::update(rclcpp::Time const&, rclcpp::Duration const&)
[ros2_control_node-1] #8 Object "/opt/ros/humble/lib/libdiff_drive_controller.so", at 0x7f904cb64f, in diff_drive_controller::DiffDriveController::update(rclcpp::Time const&, rclcpp::Duration const&)
[ros2_control_node-1] #7 Object "/opt/ros/humble/lib/librclcpp_lifecycle.so", at 0x7f95ffc5bb, in rclcpp_lifecycle::State::id() const
[ros2_control_node-1] #6 Object "/usr/lib/aarch64-linux-gnu/libstdc++.so.6.0.30", at 0x7f96522d23, in __cxa_throw
[ros2_control_node-1] #5 Object "/usr/lib/aarch64-linux-gnu/libstdc++.so.6.0.30", at 0x7f96522a3f, in std::terminate()
[ros2_control_node-1] #4 Object "/usr/lib/aarch64-linux-gnu/libstdc++.so.6.0.30", at 0x7f965229db, in
[ros2_control_node-1] #3 Object "/usr/lib/aarch64-linux-gnu/libstdc++.so.6.0.30", at 0x7f965251fb, in __gnu_cxx::__verbose_terminate_handler()
[ros2_control_node-1] #2 Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0x7f962c712f, in abort
[ros2_control_node-1] #1 Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0x7f962da67b, in raise
[ros2_control_node-1] #0 Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0x7f9631f200, in
[ros2_control_node-1] Aborted (Signal sent by tkill() 43824 0)
Is there any way to fix this? Using Humble 22.04
Quick follow up question. In my case respawining the ros2_control_node
when it crashes would fix the issue, but because I am using the controller_manager
spawner
to do the join_state_broadcaster
and the base_controller
of my robot, which they "die" after loading the controller, respawning the ros2_control_node
does not fully does it. Is there a way to write the launch file in python so that if the ros2_control_node
dies if relaunches along with the two controller_manager
spawners
?
My current relevant section of my launch files is the following:
control_node = Node(
package="controller_manager",
executable="ros2_control_node",
parameters=[robot_description, robot_controllers],
remappings=[("/diagnostics", DIAGNOSTICS_CONTROL_TOPIC)],
output={
"stdout": "screen",
"stderr": "screen",
},
respawn = True,
respawn_delay = 2
)
joint_state_broadcaster_spawner = Node(
package="controller_manager",
executable="spawner",
arguments=["joint_state_broadcaster",
"-c",
"/controller_manager",
"--controller-manager-timeout",
"10"]
)
robot_controller_spawner = Node(
package="controller_manager",
executable="spawner",
arguments=["base_controller",
"-c",
"/controller_manager",
"--controller-manager-timeout",
"10"]
)
@pepeRossRobotics The only way we found to prevent the crash from happening is forking ros2_control and surround the place where the problematic method is called with a while/try/catch but that is not a good way to do it.
Concerning the launch file, I believe you can use event handlers to start the "spawner" when the controller_manager
process starts. I haven't tried it myself, but I think it should work.
Many thanks for the help, I assumed something like this was possible, but didn't know how to implement them. I will give them a go now.
Concerning the launch file, I believe you can use event handlers to start the "spawner" when the controller_manager process starts. I haven't tried it myself, but I think it should work.
There is some of those in our example code. Maybe you find this useful: https://github.com/StoglRobotics/ros_team_workspace/blob/master/templates/ros2_control/robot_ros2_control.launch.py#L186
Concerning the launch file, I believe you can use event handlers to start the "spawner" when the controller_manager process starts. I haven't tried it myself, but I think it should work.
There is some of those in our example code. Maybe you find this useful: https://github.com/StoglRobotics/ros_team_workspace/blob/master/templates/ros2_control/robot_ros2_control.launch.py#L186
this is really helpful, many thanks!
Just wanted to add another stack trace to the pile. This is also on Humble on Ubuntu 22.04 LTS.
[ros2_control_node-1] terminate called after throwing an instance of 'std::runtime_error'
[ros2_control_node-1] what(): Error in state! Internal state_handle is NULL.
[ros2_control_node-1] Stack trace (most recent call last) in thread 178:
[ros2_control_node-1] #13 Object "", at 0xffffffffffffffff, in
[ros2_control_node-1] #12 Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fead00ce9ff, in
[ros2_control_node-1] #11 Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fead003cb42, in
[ros2_control_node-1] #10 Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fead02cc2b2, in
[ros2_control_node-1] #9 Object "/opt/ros/humble/lib/controller_manager/ros2_control_node", at 0x564dbe0793db, in
[ros2_control_node-1] #8 Object "/opt/ros/humble/lib/libcontroller_manager.so", at 0x7fead069f61c, in controller_manager::ControllerManager::update(rclcpp::Time const&, rclcpp::Duration const&)
[ros2_control_node-1] #7 Object "/opt/ros/humble/lib/librclcpp_lifecycle.so", at 0x7feacfe3d981, in
[ros2_control_node-1] #6 Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fead029e517, in __cxa_throw
[ros2_control_node-1] #5 Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fead029e2b6, in std::terminate()
[ros2_control_node-1] #4 Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fead029e24b, in
[ros2_control_node-1] #3 Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fead0292bbd, in
[ros2_control_node-1] #2 Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7feacffd07f2, in abort
[ros2_control_node-1] #1 Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7feacffea475, in raise
[ros2_control_node-1] #0 Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fead003ea7c, in pthread_kill
[ros2_control_node-1] Aborted (Signal sent by tkill() 104 0)
Two updates on this:
- I made a PR to rclcpp to patch the thread safety issue in the lifecycle node state: https://github.com/ros2/rclcpp/pull/2183. This will probably take a while to get merged and released in Humble.
- @EzraBrooks found that this is probably due to a regression that was introduced in #922, which adds a new call to
get_state()
from within a timer callback. On Humbleget_state()
isn't thread-safe, so the addition of this callback caused the following release of ros2_control on Humble to segfault.
I think in the near-term #922 should be reverted. While diagnostic info is nice to have, it's not acceptable for the robot control nodes to segfault like this.
@schornakj Thanks for the work and the PR to rclcpp.
Yes we also noticed that the issue was introduced when the diagnostics were added but we weren't sure because that it was the only source of errors.
As you can see in the diagram I posted in the first message, there are three call hierarchy that leads to calling get_state()
:
- service callbacks (which are mutually exclusive)
-
update
method - diagnostic timer
We guessed that there is some mechanism that prevents the get_state()
method being called from both the update
method and service callbacks, but we didn't look to much into it.
But, yeah, once you add the diagnostic timer, there is definitely a race with the two other threads.