rclpy icon indicating copy to clipboard operation
rclpy copied to clipboard

KeyError in ActionServer._execute_goal

Open rafael1193 opened this issue 11 months ago • 5 comments

Bug report

Required Info:

  • Operating System:
    • Ubuntu 22.04
  • Installation type:
    • Binaries (.deb package)
  • Version or commit hash:
    • 3.3.12-1jammy.20240217.065750
  • DDS implementation:
    • Fast-RTPS (2.6.7-1jammy.20240217.034748)
  • Client library (if applicable):
    • rclpy

Steps to reproduce issue

  • Run an action server node with the use_sim_time parameter as True
  • Have a node emit clock messages faster than wall clock moves (at x60)
  • Call the action

Expected behavior

Action goal result is set without problem.

Actual behavior

The action server fails because self._result_futures[bytes(goal_uuid)].set_result(result_response) in the ActionServer _execute_goal method raises a KeyError exception:

Traceback (most recent call last):
  File "/home/rbailonr/adauv_ws/install/adauv_bluebox/lib/adauv_bluebox/vehicle.py", line 333, in main
    executor.spin()
  File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 294, in spin
    self.spin_once()
  File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 794, in spin_once
    self._spin_once_impl(timeout_sec)
  File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 791, in _spin_once_impl
    future.result()
  File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/task.py", line 94, in result
    raise self.exception()
  File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/task.py", line 239, in __call__
    self._handler.send(None)
  File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/action/server.py", line 352, in _execute_goal
    self._result_futures[bytes(goal_uuid)].set_result(result_response)
KeyError: b'EM$\xc6\xb3YA:\xa9\xa1_\xd1\xeb\x88\x06\x88'

Additional information

This occurrence of this behavior is inconsistent, but only happens in accelerated time.

It looks like a synchronization issue with ActionServer._execute_expire_goals that removes the goal result future from the _result_futures dictionary before _execute_goal sets its result. Checking not self._result_futures[goal_uuid].done() before removing the expired goal seems to mitigate the issue, but the actual source of the problem may be elsewhere.

rafael1193 avatar Feb 29 '24 15:02 rafael1193

It should not be a bug.
There is a Result caching in action design (Please refer to https://design.ros2.org/articles/actions.html).
Users can decide how long the goal result will be retained. However, it is not guaranteed that the goal result will be consistently retained even if it has not been accessed.

https://github.com/ros2/rclpy/blob/220d714b2b6da81a4abd6a804e3ed4ee8cfd7c3f/rclpy/rclpy/action/server.py#L200

Barry-Xu-2018 avatar Mar 01 '24 02:03 Barry-Xu-2018

Have a node emit clock messages faster than wall clock moves (at x60)

@Barry-Xu-2018 this will make the result expired (removed) 60 times faster in the action server. i think precisely that this situation could happen, but generating KeyError exception is not designed behavior?

i think we can check if the key bytes(goal_uuid) exists in _result_futures dictionary, and, goal_handle is also remove from _goal_handles. probably it can print warning that says "Goal is already expired before completion", what do you think?

fujitatomoya avatar Mar 01 '24 06:03 fujitatomoya

i think precisely that this situation could happen, but generating KeyError exception is not designed behavior?

Yes. Throwing an exception is not a correct behavior in this scenario.

i think we can check if the key bytes(goal_uuid) exists in _result_futures dictionary, and, goal_handle is also remove from _goal_handles. probably it can print warning that says "Goal is already expired before completion", what do you think?

Yes. I'll double-check the code logic to see if there are any other places that may need modification when this issue occurs.

Barry-Xu-2018 avatar Mar 01 '24 09:03 Barry-Xu-2018

https://github.com/ros2/rclpy/blob/220d714b2b6da81a4abd6a804e3ed4ee8cfd7c3f/rclpy/rclpy/action/server.py#L327-L353

https://github.com/ros2/rclpy/blob/220d714b2b6da81a4abd6a804e3ed4ee8cfd7c3f/rclpy/rclpy/action/server.py#L413-L419

These two parts involve asynchronous operations, so to completely avoid the issue, locking (asyncio.Lock) should be added when operating on self._result_futures.

Barry-Xu-2018 avatar Mar 01 '24 09:03 Barry-Xu-2018

Thank you very much for your help.

In hindsight, some expiring timeout message was to be expected but not the action server failing. I have increased result_timeout accordingly and the synchronization problem does not seem to appear anymore.

rafael1193 avatar Mar 04 '24 12:03 rafael1193