rclpy
rclpy copied to clipboard
KeyError in ActionServer._execute_goal
Bug report
Required Info:
- Operating System:
- Ubuntu 22.04
- Installation type:
- Binaries (.deb package)
- Version or commit hash:
- 3.3.12-1jammy.20240217.065750
- DDS implementation:
- Fast-RTPS (2.6.7-1jammy.20240217.034748)
- Client library (if applicable):
- rclpy
Steps to reproduce issue
- Run an action server node with the
use_sim_time
parameter asTrue
- Have a node emit clock messages faster than wall clock moves (at x60)
- Call the action
Expected behavior
Action goal result is set without problem.
Actual behavior
The action server fails because self._result_futures[bytes(goal_uuid)].set_result(result_response)
in the ActionServer _execute_goal method raises a KeyError
exception:
Traceback (most recent call last):
File "/home/rbailonr/adauv_ws/install/adauv_bluebox/lib/adauv_bluebox/vehicle.py", line 333, in main
executor.spin()
File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 294, in spin
self.spin_once()
File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 794, in spin_once
self._spin_once_impl(timeout_sec)
File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 791, in _spin_once_impl
future.result()
File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/task.py", line 94, in result
raise self.exception()
File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/task.py", line 239, in __call__
self._handler.send(None)
File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/action/server.py", line 352, in _execute_goal
self._result_futures[bytes(goal_uuid)].set_result(result_response)
KeyError: b'EM$\xc6\xb3YA:\xa9\xa1_\xd1\xeb\x88\x06\x88'
Additional information
This occurrence of this behavior is inconsistent, but only happens in accelerated time.
It looks like a synchronization issue with ActionServer._execute_expire_goals
that removes the goal result future from the _result_futures
dictionary before _execute_goal
sets its result. Checking not self._result_futures[goal_uuid].done()
before removing the expired goal seems to mitigate the issue, but the actual source of the problem may be elsewhere.
It should not be a bug.
There is a Result caching
in action design (Please refer to https://design.ros2.org/articles/actions.html).
Users can decide how long the goal result will be retained. However, it is not guaranteed that the goal result will be consistently retained even if it has not been accessed.
https://github.com/ros2/rclpy/blob/220d714b2b6da81a4abd6a804e3ed4ee8cfd7c3f/rclpy/rclpy/action/server.py#L200
Have a node emit clock messages faster than wall clock moves (at x60)
@Barry-Xu-2018 this will make the result expired (removed) 60 times faster in the action server.
i think precisely that this situation could happen, but generating KeyError
exception is not designed behavior?
i think we can check if the key bytes(goal_uuid)
exists in _result_futures
dictionary, and, goal_handle
is also remove from _goal_handles
. probably it can print warning that says "Goal is already expired before completion", what do you think?
i think precisely that this situation could happen, but generating KeyError exception is not designed behavior?
Yes. Throwing an exception is not a correct behavior in this scenario.
i think we can check if the key bytes(goal_uuid) exists in _result_futures dictionary, and, goal_handle is also remove from _goal_handles. probably it can print warning that says "Goal is already expired before completion", what do you think?
Yes. I'll double-check the code logic to see if there are any other places that may need modification when this issue occurs.
https://github.com/ros2/rclpy/blob/220d714b2b6da81a4abd6a804e3ed4ee8cfd7c3f/rclpy/rclpy/action/server.py#L327-L353
https://github.com/ros2/rclpy/blob/220d714b2b6da81a4abd6a804e3ed4ee8cfd7c3f/rclpy/rclpy/action/server.py#L413-L419
These two parts involve asynchronous operations, so to completely avoid the issue, locking (asyncio.Lock) should be added when operating on self._result_futures
.
Thank you very much for your help.
In hindsight, some expiring timeout message was to be expected but not the action server failing. I have increased result_timeout
accordingly and the synchronization problem does not seem to appear anymore.