navigation2 icon indicating copy to clipboard operation
navigation2 copied to clipboard

rclcpp_action goal handle timeout logic clearing long-running goals

Open AlexeyMerzlyakov opened this issue 2 years ago • 15 comments

This issue describes the flaky issues appeared with test_waypoint_follower system test

Bug report

Required Info:

  • Operating System:
    • Ubuntu 22.04.3
  • ROS2 Version:
    • ROS2 Rolling (built from sources)
  • Version or commit hash:
    • b6f3b028 (and older 822560ce)
  • DDS implementation:
    • Cyclone DDS

Steps to reproduce issue

  1. Enable visualization for Waypoint Follower system test by following patch:
diff --git a/nav2_system_tests/src/waypoint_follower/test_case_py.launch b/nav2_system_tests/src/waypoint_follower/test_case_py.launch
index adeb839c..c7194fe9 100755
--- a/nav2_system_tests/src/waypoint_follower/test_case_py.launch
+++ b/nav2_system_tests/src/waypoint_follower/test_case_py.launch
@@ -40,6 +40,8 @@ def generate_launch_description():
     bringup_dir = get_package_share_directory('nav2_bringup')
     params_file = os.path.join(bringup_dir, 'params/nav2_params.yaml')
 
+    rviz_config_file = os.path.join(bringup_dir, 'rviz', 'nav2_default_view.rviz')
+
     # Replace the `use_astar` setting on the params file
     configured_params = RewrittenYaml(
         source_file=params_file,
@@ -59,6 +61,13 @@ def generate_launch_description():
                  '--minimal_comms', world],
             output='screen'),
 
+        # Launch rviz
+        Node(
+            package='rviz2',
+            executable='rviz2',
+            arguments=['-d', rviz_config_file],
+            output='screen'),
+
         # TODO(orduno) Launch the robot state publisher instead
         #              using a local copy of TB3 urdf file
         Node(
  1. Run test_waypoint_follower manually, or using the attached below stress-test script: run_stress.sh.txt

Expected behavior

Test passes 30 times in a row.

Actual behavior

The following problems were found during the test:

Problem1: No action

When the robot is moving near the obstacle, it could stuck. Sometimes, this causes recovery behaviors that being triggered. If this recovery was triggered (moreover, it was detected that only "Received request to clear entirely the local_costmap" event is enough for that), and robot successfully reached its goal, but the NavigateToPoseNavigator action returns UNKNOWN status instead of SUCCEEDED, which causes main WaypointFollower cycle to hang:

https://github.com/ros-planning/navigation2/assets/60094858/1731f55f-7fe7-491b-9bdf-1a1f84cf701b

Problem2: Incorrect out-of-map status

Sometimes, when running set waypoint outside of map test scenario, missed_waypoint contains INVALID_PATH=103 status instead of expected ComputePathToPose.Goal().GOAL_OUTSIDE_MAP=204 which gives the following error messages:

[bt_navigator-12] [INFO] [1692161747.546793612] [bt_navigator]: Begin navigating from current location to (100.00, 100.00)
[planner_server-10] [WARN] [1692161747.548072894] [planner_server]: GridBased plugin failed to plan from (0.70, 0.19) to (100.00, 100.00): "Goal Coordinates of(100.000000, 100.000000) was outside bounds"
[planner_server-10] [WARN] [1692161747.548158462] [planner_server]: [compute_path_to_pose] [ActionServer] Aborting handle.
[bt_navigator-12] [ERROR] [1692161747.577081372] [bt_navigator]: Goal failed
[bt_navigator-12] [WARN] [1692161747.577271398] [bt_navigator]: [navigate_to_pose] [ActionServer] Aborting handle.
[tester_node-16] [INFO] [1692161747.949221718] [nav2_waypoint_tester]: Goal failed to process all waypoints, missed 1 wps.
[tester_node-16] Traceback (most recent call last):
[tester_node-16]   File "/home/amerzlyakov/nav2_ws/src/navigation2/nav2_system_tests/src/waypoint_follower/tester.py", line 271, in <module>
[tester_node-16]     main()
[tester_node-16]   File "/home/amerzlyakov/nav2_ws/src/navigation2/nav2_system_tests/src/waypoint_follower/tester.py", line 233, in main
[tester_node-16]     assert test.action_result.missed_waypoints[0].error_code == \
[tester_node-16] AssertionError
Problem3: Robot is getting lost after setting initial pose second time

In the stop on failure test with bogous waypoint test scenario there is test.setInitialPose(starting_pose) called second time while robot is not placed on its initial pose. This causes robot to lose its position. In most cases, robot still operating and reaching its goal; but sometimes since its position is getting lost, robot continues its uncontrolled movement between pillars causing testcase to fail by timeout: Screenshot_2023-08-16_09-41-43_cr Screenshot_2023-08-16_09-41-50_cr

AlexeyMerzlyakov avatar Aug 18 '23 15:08 AlexeyMerzlyakov

The Problem3 is being resolved by removing second setInitialPose() in "stop on failure test with bogous waypoint" case, so robot became to move on map correctly:

diff --git a/nav2_system_tests/src/waypoint_follower/tester.py b/nav2_system_tests/src/waypoint_follower/tester.py
index 79604767..5a9a89cc 100755
--- a/nav2_system_tests/src/waypoint_follower/tester.py
+++ b/nav2_system_tests/src/waypoint_follower/tester.py
@@ -236,9 +236,9 @@ def main(argv=sys.argv[1:]):
     # stop on failure test with bogous waypoint
     test.setStopFailureParam(True)
     bwps = [[-0.52, -0.54], [100.0, 100.0], [0.58, 0.52]]
-    starting_pose = [-2.0, -0.5]
+    #starting_pose = [-2.0, -0.5]
     test.setWaypoints(bwps)
-    test.setInitialPose(starting_pose)
+    #test.setInitialPose(starting_pose)
     result = test.run(True, False)
     assert not result
     result = not result

AlexeyMerzlyakov avatar Aug 18 '23 15:08 AlexeyMerzlyakov

The Problem1 and Problem2 are under investigation, I'll write the details as they will be arrived.

AlexeyMerzlyakov avatar Aug 18 '23 15:08 AlexeyMerzlyakov

Analysis shown that Problem2 appears in case when FollowPath BT node returned non-zero INVALID_PATH=103 status for the preemption goals (for some reasons), while next outside-of-map goal scenario adds GOAL_OUTSIDE_MAP=204 status to ComputePathToPose BT node blackboard. BtActionServer<ActionT>::populateErrorCode() routine will select the status with highest priority (INVALID_PATH < GOAL_OUTSIDE_MAP) and will return status from previous run.

So this problem, I consider, appear because of we do not clean BT node statuses between two consecutive bt_->run() runs, and thus keeping errors from previous BT run.

The following patch seems to work for me:

diff --git a/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_server.hpp b/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_server.hpp
index 79b1df5b..e0fc13e9 100644
--- a/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_server.hpp
+++ b/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_server.hpp
@@ -196,6 +196,11 @@ protected:
    */
   void populateErrorCode(typename std::shared_ptr<typename ActionT::Result> result);
 
+  /**
+   * @brief Setting BT error codes to success. Used to clean blackboard between different BT runs
+   */
+  void cleanErrorCodes();
+
   // Action name
   std::string action_name_;
 
diff --git a/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_server_impl.hpp b/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_server_impl.hpp
index 11110b09..a60fa69f 100644
--- a/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_server_impl.hpp
+++ b/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_server_impl.hpp
@@ -265,6 +272,8 @@ void BtActionServer<ActionT>::executeCallback()
       action_server_->terminate_all(result);
       break;
   }
+
+  cleanErrorCodes();
 }
 
 template<class ActionT>
@@ -291,6 +302,18 @@ void BtActionServer<ActionT>::populateErrorCode(
   }
 }
 
+template<class ActionT>
+void BtActionServer<ActionT>::cleanErrorCodes()
+{
+  if (!blackboard_) {
+    RCLCPP_WARN(logger_, "Blackboard is not created to be cleaned");
+  }
+
+  for (const auto & error_code : error_code_names_) {
+    blackboard_->set<unsigned short>(error_code, 0);
+  }
+}
+
 }  // namespace nav2_behavior_tree
 
 #endif  // NAV2_BEHAVIOR_TREE__BT_ACTION_SERVER_IMPL_HPP_

@SteveMacenski, doesn't this approach break whole BT logic, I hope?

AlexeyMerzlyakov avatar Aug 22 '23 13:08 AlexeyMerzlyakov

The Problem1 root cause is still unclear. Some observations on it:

From BT side, when the robot reached its goal, NavigateToPoseNavigator calls succeeded_current() for its action server:

  switch (rc) {
    case nav2_behavior_tree::BtStatus::SUCCEEDED:
      RCLCPP_INFO(logger_, "Goal succeeded");
      action_server_->succeeded_current(result);
      break;

which should trigger SUCCEEDED status of the goal.

However, from WaypointFollower side, the WaypointFollower::resultCallback() gets UNKNOWN status of this result, which seems really strange:

  switch (result.code) {
    case rclcpp_action::ResultCode::SUCCEEDED:
      current_goal_status_.status = ActionStatus::SUCCEEDED;
      return;
    case rclcpp_action::ResultCode::ABORTED:
      current_goal_status_.status = ActionStatus::FAILED;
      current_goal_status_.error_code = result.result->error_code;
      return;
    case rclcpp_action::ResultCode::CANCELED:
      current_goal_status_.status = ActionStatus::FAILED;
      return;
    case rclcpp_action::ResultCode::UNKNOWN:  // Alexey's debugging
      RCLCPP_INFO(get_logger(), "Got UNKNOWN result from BT");
      current_goal_status_.status = ActionStatus::FAILED;
      return;

From first glance, it might seem that the issue to be in the RCLCPP. However, this problem appears every time when recovery RecoveryFallback switched during the navigation, and, opposite, does not appear during normal navigation subtree operation w/o recoveries. This makes me think, that during switching to recoveries something is happening with nav2_msgs::action::NavigateToPose action server-client chain (e.g. RCLCPP context was changed, new pointer (re)initiated, or something else).

AlexeyMerzlyakov avatar Aug 22 '23 13:08 AlexeyMerzlyakov

So this problem, I consider, appear because of we do not clean BT node statuses between two consecutive bt_->run() runs, and thus keeping errors from previous BT run.

Oh, that's a really good find Alexey! Yes, resetting the blackboard for the error codes is sensible at the end of a task and/or when halted due to cancellation of the navigation action.

+  if (!blackboard_) {
+    RCLCPP_WARN(logger_, "Blackboard is not created to be cleaned");
+  }

You should be able to skip this, that blackboard is part of the factory construction so it should always exist at this point.

However, this problem appears every time when recovery RecoveryFallback switched during the navigation, and, opposite, does not appear during normal navigation subtree operation w/o recoveries.

That is odd. The best that I can think of is that the WPF is sending the Nav2Pose to the BT which itself then sends the appropriate commands to the task servers. When we enter recovery, the task servers are halted when we execute recoveries and then re-tasked once we finish and start navigating again. So the commands to planner/controller/etc are re-issued and thus have a new UUID / goal handle. But we should still be in the same goal handle for Nav2Pose since all that happens internally to the BT which the Nav2Pose action server doesn't track.

Perhaps this is another error code issue or something we're we're propagating the error code from the halted task servers instead of the final, complete end result code? That's the only server-client changes I can see happening.

Note these shenanigans we have to do with checking that the goal ID is right https://github.com/ros-planning/navigation2/blob/main/nav2_waypoint_follower/src/waypoint_follower.cpp#L307-L312

Try changing the debug to an info in the simple action server to make sure that the current handle is actually active / we're returning success on it?

  void succeeded_current(
    typename std::shared_ptr<typename ActionT::Result> result =
    std::make_shared<typename ActionT::Result>())
  {
    std::lock_guard<std::recursive_mutex> lock(update_mutex_);

    if (is_active(current_handle_)) {
      debug_msg("Setting succeed on current goal.");  // Here and check logs
      current_handle_->succeed(result);
      current_handle_.reset();
    }
  }

SteveMacenski avatar Aug 22 '23 17:08 SteveMacenski

Problem1 bugfix:

I've added the checks and got that the goal_id was always the same for BT ActionServer and in resultCallback() for WPF each time during 3-waypoints test scenario phase:

[tester_node-16] [INFO] [1692773504.064704366] [nav2_waypoint_tester]: Received amcl_pose
[tester_node-16] [INFO] [1692773504.718885261] [nav2_waypoint_tester]: Received amcl_pose
[controller_server-8] [INFO] [1692773504.958734862] [controller_server]: Passing new path to controller.
[tester_node-16] [INFO] [1692773505.812664647] [nav2_waypoint_tester]: Received amcl_pose
[controller_server-8] [INFO] [1692773505.914895209] [controller_server]: Reached the goal!
[bt_navigator-12] [INFO] [1692773505.947715042] [bt_navigator]: Goal succeeded
[bt_navigator-12] ALEX: succeeded_current: result.goal_id = c3a927770364788626b154153d323c1
<- this goal_id was send to succeeded_current() by BT
[bt_navigator-12] [INFO] [1692773505.948040788] [bt_navigator]: ALEX: succeeded_current
[waypoint_follower-13] ALEX: WaypointFollower::resultCallback: result.goal_id = c3a927770364788626b154153d323c1
<- the same goal_id received from BT. No issue here

[waypoint_follower-13] [INFO] [1692773506.576328268] [waypoint_follower]: ALEX: got UNKNOWN result from BT
[waypoint_follower-13] [INFO] [1692773506.626355822] [waypoint_follower]: Failed to process waypoint 2, moving to next.
[waypoint_follower-13] [INFO] [1692773506.626475290] [waypoint_follower]: Completed all 3 waypoints requested.
[tester_node-16] [INFO] [1692773506.634029050] [nav2_waypoint_tester]: Goal failed to process all waypoints, missed 1 wps.
[tester_node-16] Traceback (most recent call last):
[tester_node-16]   File "/home/amerzlyakov/nav2_ws/src/navigation2/nav2_system_tests/src/waypoint_follower/tester.py", line 249, in <module>
[tester_node-16]     main()
[tester_node-16]   File "/home/amerzlyakov/nav2_ws/src/navigation2/nav2_system_tests/src/waypoint_follower/tester.py", line 214, in main
[tester_node-16]     assert result
[tester_node-16] AssertionError

So, this made me think that this is related to RCLCPP and check it. After some analysis, it was found the same problem described in https://github.com/ros2/rclcpp/issues/2101. RCLCPP considers the the result status not updated more than 10 seconds as UNKNOWN and returns it back to WPF in resultCallback(). This also explains, why the Problem1 appeared only on recoveries situation.

The problem did solved by increasing the result_timeout for BtActionServer to 30 seconds:

diff --git a/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_server_impl.hpp b/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_server_impl.hpp
index 11110b09..8782a996 100644
--- a/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_server_impl.hpp
+++ b/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_server_impl.hpp
@@ -109,12 +111,17 @@ bool BtActionServer<ActionT>::on_configure()
   // Support for handling the topic-based goal pose from rviz
   client_node_ = std::make_shared<rclcpp::Node>("_", options);
 
+  rcl_action_server_options_t server_options = rcl_action_server_get_default_options();
+  server_options.result_timeout.nanoseconds = RCL_S_TO_NS(30);  // 30 seconds timeout to get the result from BT
+
   action_server_ = std::make_shared<ActionServer>(
     node->get_node_base_interface(),
     node->get_node_clock_interface(),
     node->get_node_logging_interface(),
     node->get_node_waitables_interface(),
-    action_name_, std::bind(&BtActionServer<ActionT>::executeCallback, this));
+    action_name_, std::bind(&BtActionServer<ActionT>::executeCallback, this),
+    nullptr, std::chrono::milliseconds(500), false, server_options);
 
   // Get parameters for BT timeouts
   int timeout;

The value of 30 seconds was chosen by numerous experiments on 3-waypoints initial scenario runs and obtaining the timings required for recoveries on 3rd most difficult waypoint: Correct_timeout

(Sure, this timeout might be set as ROS-parameter for bt_navigator, rather than hardcoded above, if needed)


Also, it was found that due to short distances of navigation, the robot operates much more stable on prune_distance = 1.0 instead of default 2.0 meters value (see https://github.com/ros-planning/navigation2/pull/3098 for more details). Local experiments on https://github.com/ros-planning/navigation2/commit/822560ce551afde513deb8be5e5c6a750f8e6fc0 shows that probability of recovery situation in 3-waypoints scenario for waypoint follower system test as follows:

Runs Recoveries Recovery probability, %
Prune 1m 48 5 10%
Prune 2m (default) 46 14 30%

So, prune_distance and forward_prune_distance are considered be shortened for WaypointFollwer system test to 1.0 meter. Actually, the problem that these parameters are not described in nav2_params.yaml and RewrittenYaml can't replace non-existing parameters yet (this is some kind of stuck situation for now). The best solution - is to add new parameters declaring ability API to RewrittenYam, which will also help to avoid boilerplate works synchronizing Costmap Filters system tests YAML's with the main one.


Finally, I'd like to propose to treat UNKNOWN (default case) statuses in WPF main loop as failures, which will force WPF to fail, rather than hang for infinite time.

AlexeyMerzlyakov avatar Aug 23 '23 13:08 AlexeyMerzlyakov

After some analysis, it was found the same problem described in https://github.com/ros2/rclcpp/issues/2101. RCLCPP considers the the result status not updated more than 10 seconds as UNKNOWN and returns it back to WPF in resultCallback()

Oh what the hell... How does this not cause our end-users problems as well for the WPF, the controller server, and the BT task in general? Those are frequently running tasks longer than 10 seconds.

Any time limit for how long a task runs in the action server is too low since we could have navigation requests (or WPF, or controller, etc) that could take hours to complete. Lets set this to absolute max anywhere we send actions within Nav2 and inform users about this problem publicly // comment on that ticket for a default to be maxed out or disabled.

SteveMacenski avatar Aug 24 '23 18:08 SteveMacenski

I think that logic needs to be updated so its not based on a timeout but a timeout after an event (e.g. after the goal has been completed or aborted) so that we don't delete currently running goals. Can you file a ticket on that and tag me and @clalancette ?

SteveMacenski avatar Aug 24 '23 18:08 SteveMacenski

New PR #3785 with fix of the most of issues described in this ticket. The problem with RCLCPP action timeout to be handled in next separate step, I'd suppose.

AlexeyMerzlyakov avatar Aug 30 '23 11:08 AlexeyMerzlyakov

Got it, @pepisg messaged me yesterday on Slack regarding re-adding the 15 minute timeout on the WPF server specifically as a temporary stop gap. Since the controller updates goals each time its preempted and planning occurs quickly, I think its only an issue with the WPF internally to Nav2.

... but is definitely still an issue for users of Nav2's base action servers for Nav2Pose and NavThroughPoses, which bears some thought about a permanent solution in rclcpp. Is that something you're open to working on @AlexeyMerzlyakov ?

Thanks for your diligent work on this and uncovering an rclcpp issue!

SteveMacenski avatar Aug 30 '23 16:08 SteveMacenski

The params have been merged in, so that should be a good medium term solution until a more permanent change can be made in rclcpp itself. As such, I renamed the ticket

SteveMacenski avatar Sep 09 '23 17:09 SteveMacenski

@SteveMacenski @AlexeyMerzlyakov the fix is ready https://github.com/ros2/rcl/pull/1121, it would be appreciated if you can try and review! thanks in advance.

CC: @Barry-Xu-2018

fujitatomoya avatar Dec 06 '23 17:12 fujitatomoya

So much thanks @barry-xu-2018!

SteveMacenski avatar Dec 06 '23 19:12 SteveMacenski

@barry-xu-2018 can this be closed with your recent PR?

SteveMacenski avatar Dec 23 '23 05:12 SteveMacenski

@SteveMacenski

https://github.com/ros2/rcl/pull/1121 isn't merged by now.
Perhaps we need to wait for the merge before closing this issue.

Barry-Xu-2018 avatar Dec 25 '23 00:12 Barry-Xu-2018

@SteveMacenski just FYI, https://github.com/ros2/rcl/pull/1121 has been merged in rolling.

fujitatomoya avatar Jul 29 '24 18:07 fujitatomoya

@fujitatomoya are we safe to remove action_server_result_timeout now? What was the final solution we handed on for my curiosity?

SteveMacenski avatar Jul 31 '24 16:07 SteveMacenski

Closing as complete, leaving the result timeout for legacy but I'll file a ticket for removing this at Kilted

SteveMacenski avatar Sep 10 '24 22:09 SteveMacenski