rclpy
rclpy copied to clipboard
Race condition in `ActionClient.*_async` functions which result in a future that never complete
Bug report
Hi all, thank you for maintaining this wonderful library!
I've been encountering an intermittent bug that has left my head scratching. Unfortunately recently it started occurring more often, so I sat down and spent some time tracking down what was going on. Now, I've got a good idea of the problem- but I need some more expert eyes to decide on a good solution. I'll lay out all my findings and some perceived paths forward in this bug report.
Required Info:
- Operating System:
- Ubuntu 22.04, within a docker image that is also running Ubuntu 22.04
- Installation type:
- Binaries as installed from ros2 apt.
- Version or commit hash:
- rclpy==3.3.7, python3.10
- DDS implementation:
- CycloneDDS
- Client library (if applicable):
- rclpy==3.3.7, python3.10
Steps to reproduce issue
There are multiple places where this race condition manifests itself. I will demonstrate how to reproduce it when calling send_goal_async
. Take the ActionClient.send_goal_async
and add a sleep(#)
directly below the line sequence_number = self._client_handle.send_goal_request(request)
. For example:
def send_goal_async(self, goal, feedback_callback=None, goal_uuid=None):
"""
...
"""
if not isinstance(goal, self._action_type.Goal):
raise TypeError()
request = self._action_type.Impl.SendGoalService.Request()
request.goal_id = (
self._generate_random_uuid() if goal_uuid is None else goal_uuid
)
request.goal = goal
sequence_number = self._client_handle.send_goal_request(request)
sleep(5) # <--- this line causes this race condition nearly 100% of the time
if sequence_number in self._pending_goal_requests:
raise RuntimeError(
"Sequence ({}) conflicts with pending goal request".format(
sequence_number
)
)
if feedback_callback is not None:
# TODO(jacobperron): Move conversion function to a general-use package
goal_uuid = bytes(request.goal_id.uuid)
self._feedback_callbacks[goal_uuid] = feedback_callback
future = Future()
self._pending_goal_requests[sequence_number] = future
self._goal_sequence_number_to_goal_id[sequence_number] = request.goal_id
future.add_done_callback(self._remove_pending_goal_request)
# Add future so executor is aware
self.add_future(future)
return future
Expected behavior
I would expect 'send_goal_async' to run a bit slower than usual, then return a future. That future might complete quite quickly (or already be complete).
Actual behavior
The future never completes, and I see the following line in the logs
Ignoring unexpected goal response. There may be more than one action server for the action '{ACTION_NAME}'
Additional information
By my sleuthing, it appears that this error is caused by the following important lines being called:
client:Thread1
-
send_goal_async
is called, and within it: -
sequence_number = self._client_handle.send_goal_request(request)
-
self._goal_sequence_number_to_goal_id[sequence_number] = request.goal_id
client:Thread2
- A response to a goal request is received in the client C bindings, causing it to return True to
is_ready(...)
-
execute(...)
is called -
if sequence_number in self._goal_sequence_number_to_goal_id:
evaluates to False, becauseclient:Thread1
has not reached its step 3 - It ignores the
taken_data
, and thus the future will never haveset_result
called.
server:Thread1
- A goal request is received, and responded to
Cause Summary
There is a race condition when sending goal requests, because the goal request can complete before the sequence_number
is tracked in the self._goal_sequence_number_to_goal_id
dictionary. Thus execute
is called before it is tracked, and the taken_data
is ignored, assumed to be a response from a duplicate action server.
I believe this same race condition affects all usages of the client API, such as _cancel_goal_async
, or _get_result_async
.
Recommendations
- Put a lock around every usage of
self._client_handle
so that python-side objects can be edited and synced up. That way ifexecute
is called it will block until theself._client_handle
isn't being used, and the_goal_sequence_number_to_goal_id
will be filled in. - Alternatively, create locks for the various in-memory dictionaries that are used in
execute
:
- _goal_handles
- _pending_goal_requests
- _goal_sequence_number_to_goal_id
- _pending_cancel_requests
- _pending_result_requests
- _result_sequence_number_to_goal_id
Hi @apockill, I ran into the same issue. I was wondering if adding locks around dictionary _goal_sequence_number_to_goal_id
will solve the problem or not. If there is a lock, it prevents data being written or read simultaneously but it doesnt synchronize the fact that self._goal_sequence_number_to_goal_id[sequence_number] = request.goal_id
in send_goal_async
needs to happen before if sequence_number in self._goal_sequence_number_to_goal_id
check in execute
?
I believe to synchronize that, we would need some an event that is set only in send_goal_async
and waited on in execute
.
A lock around the entire send_goal_async
and execute
can overcome this? I tested that in this diff on my setup. Did not see the race condition.
Hi @apockill, do you think this issue is also present in the rclpy
ActionServer
class ?. I have an action server in python and I have seen some requests getting dropped randomly.
I haven't personally noticed that, but I can't say I've spent much time looking for any race conditions on the server side. What logs have you see printed when a request gets dropped- and are you sure it's not just a repeat of the client side dropping the ball?
Any thoughts on the best way to resolve this bug? At the moment this affects all users of ActionClients. I can contribute better once there's guidance from maintainers :grin:
It seems like the change I attempted didn't fully fix the problem, still see the following on some compute platforms
Ignoring unexpected goal response. There may be more than one action server for the action
Ignoring unexpected result response. There may be more than one action server for the action
I tried looking deeper in client.py
and it seems a simple condition/lock wont solve the problem because it doesn't guarantee that the dictionaries are updated before they are checked in async def execute
. So I attempted to use a threading.Event
to synchronize dictionary write and read. However each goal, result, cancel cases in async def execute
need their own events to avoid write of one wrongly triggering the read of another. Trying this now in the existing PR https://github.com/ros2/rclpy/pull/1125
It still doesn't seem very clean, another possible solution could be to use multiple queue to send the dictionaries between threads instead of referencing them.
I don't fully understand how a lock doesn't fix the issue in this case. Are you certain you were running the code with the lock when you observed this issue? 😅. It guarantees the dictionaries are updated at least in the send_goal_async case, because the execute function can't receive a sequence number until the client has been used in send_goal_async (which is inside a lock). Unless I am misunderstanding this race condition?
I have a similar fix on my end (I subclass ActionClient and lock the entirety of send_goal_async and execute) and it has solved the problem for me.
Hello @apockill ! Other than the sleep()
example, what is your use case where you're seeing this bug ? Is it a flaky one ?
Hi @adityapande-1995! I gave the sleep example just to show how this race condition is purely caused by flaky timing, and luck currently prevents most users from seeing it. Also, a much much smaller wait can also trigger it.
I'm seeing this flakiness happen on a regular basis on my stack. We happen to call actions on a regular basis (maybe 1-3 actions per second, on average). This bug will reveal itself sometimes minutes after startup, other times within a few hours of runtime.
What mainly concerns me is that it's purely timing luck that this works at all, and the error message is confusing enough that other users might run into this but not know what happened. After all, the symptom is a confusing log message and a future that will never complete.
If we could find a way to protect the relevant python state tied to sequence_number's, I believe that will solve the problem.
I don't fully understand how a lock doesn't fix the issue in this case. Are you certain you were running the code with the lock when you observed this issue? 😅. It guarantees the dictionaries are updated at least in the send_goal_async case, because the execute function can't receive a sequence number until the client has been used in send_goal_async (which is inside a lock). Unless I am misunderstanding this race condition?
I have a similar fix on my end (I subclass ActionClient and lock the entirety of send_goal_async and execute) and it has solved the problem for me.
You are right, lock does seem to solve the issue. Once I fixed the issue caused by the race condition I had another unrelated bug in my setup which was misleading me to believe that the issue is not fixed.
@adityapande-1995. Adding another viewpoint here: the dictionaries like _goal_sequence_number_to_goal_id
are being edited and read in two threads: the thread that user is sending the goal request in and the thread in which the node spins which calls async def execute
. So the race condition here is evident from the code itself.
To others who think they might be running into this issue, here's a quick hack you can use that doesn't require building rclpy:
class PatchRclpyIssue1123(ActionClient):
_lock: RLock = None # type: ignore
@property
def _cpp_client_handle_lock(self) -> RLock:
if self._lock is None:
self._lock = RLock()
return self._lock
async def execute(self, *args: Any, **kwargs: Any) -> None:
with self._cpp_client_handle_lock:
return await super().execute(*args, **kwargs) # type: ignore
def send_goal_async(self, *args: Any, **kwargs: Any) -> Future:
with self._cpp_client_handle_lock:
return super().send_goal_async(*args, **kwargs)
def _cancel_goal_async(self, *args: Any, **kwargs: Any) -> Future:
with self._cpp_client_handle_lock:
return super()._cancel_goal_async(*args, **kwargs)
def _get_result_async(self, *args: Any, **kwargs: Any) -> Future:
with self._cpp_client_handle_lock:
return super()._get_result_async(*args, **kwargs)
On another note- @adityapande-1995:
Thoughts on a solution where there's some decorator called uses_cpp_client
which can wrap all of the functions in the ActionClient
which use the underlying C++ action client?
For example:
def uses_cpp_client(fn):
def wrapper(self, *args: Any, **kwargs: Any) -> Any:
with self._cpp_client_handle_lock:
return fn(*args, **kwargs)
return wrapper
class ActionClient(Waitable):
...
@uses_cpp_client
def send_goal_async(...):
....
@uses_cpp_client
def _cancel_goal_async(...):
....
@uses_cpp_client
def _get_result_async(...):
...
@async_uses_cpp_client
async def execute(...):
...
I was looking at the code for service clients and it also uses a lock to sync the C++ state (self.__client
calls) with the python state (self.__pending_requests
). I think this highlights that the issue on the action client side is just an oversight, and that using locks is the way to go.
I'll submit a PR!
@apockill thanks for being patient on this.
I think this highlights that the issue on the action client side is just an oversight, and that using locks is the way to go.
i think that is correct assumption we have here. _pending_requests
is protected by _lock
in the service client.
Sorry for the delay in looking at this.
I certainly can see how, in theory, we can get into this situation. However, so far I've been unable to replicate the problem as originally stated on the rolling
branch. What I did was to edit action/client.py
to add in the 5 second delay. Then I launched ros2 run action_tutorials_py fibonacci_action_server
in one terminal, and ros2 run action_tutorials_py fibonacci_action_client
in a second terminal. I do indeed see the 5 second delay on the client side, but I also see the client eventually catch up and receive the output.
Does someone have a relatively simple example that I could use to try to reproduce this?
Hi @clalancette, thank you for looking into this! I've completely forgotten to submit a PR for this.
Do you think that the difference might be in the executor? Come to think of it, I'm guessing this would only occur in a MultiThreadedExecutor case, since a SingleThreadedExecutor wouldn't be able to have a race condition like this.
Do you think that the difference might be in the executor? Come to think of it, I'm guessing this would only occur in a MultiThreadedExecutor case, since a SingleThreadedExecutor wouldn't be able to have a race condition like this.
Yeah, that very well may be the case. If there is a simple example I can use to try to reproduce it, I'm certainly happy to take a look and review a PR.
Absolutely! Here's an example that replicated it on my end. You might have to switch out custom_msgs
for some other action that takes no parameters.
client.py
from time import sleep
import rclpy
from rclpy import Future
from rclpy.action import ActionClient
from rclpy.callback_groups import MutuallyExclusiveCallbackGroup
from rclpy.executors import MultiThreadedExecutor
from rclpy.node import Node
from custom_msgs.action import Home as SomeMsg
class RaceyActionClient(ActionClient):
def send_goal_async(self, goal, feedback_callback=None, goal_uuid=None):
"""
Send a goal and asynchronously get the result.
The result of the returned Future is set to a ClientGoalHandle when receipt of the goal
is acknowledged by an action server.
:param goal: The goal request.
:type goal: action_type.Goal
:param feedback_callback: Callback function for feedback associated with the goal.
:type feedback_callback: function
:param goal_uuid: Universally unique identifier for the goal.
If None, then a random UUID is generated.
:type: unique_identifier_msgs.UUID
:return: a Future instance to a goal handle that completes when the goal request
has been accepted or rejected.
:rtype: :class:`rclpy.task.Future` instance
:raises: TypeError if the type of the passed goal isn't an instance of
the Goal type of the provided action when the service was
constructed.
"""
if not isinstance(goal, self._action_type.Goal):
raise TypeError()
request = self._action_type.Impl.SendGoalService.Request()
request.goal_id = (
self._generate_random_uuid() if goal_uuid is None else goal_uuid
)
request.goal = goal
sequence_number = self._client_handle.send_goal_request(request)
sleep(5) # <--- this line causes this race condition nearly 100% of the time
if sequence_number in self._pending_goal_requests:
raise RuntimeError(
"Sequence ({}) conflicts with pending goal request".format(
sequence_number
)
)
if feedback_callback is not None:
# TODO(jacobperron): Move conversion function to a general-use package
goal_uuid = bytes(request.goal_id.uuid)
self._feedback_callbacks[goal_uuid] = feedback_callback
future = Future()
self._pending_goal_requests[sequence_number] = future
self._goal_sequence_number_to_goal_id[sequence_number] = request.goal_id
future.add_done_callback(self._remove_pending_goal_request)
# Add future so executor is aware
self.add_future(future)
return future
class SomeActionClient(Node):
def __init__(self):
super().__init__("action_client")
self._action_client = RaceyActionClient(
self,
SomeMsg,
"some_action",
callback_group=MutuallyExclusiveCallbackGroup(),
)
self.call_action_timer = self.create_timer(
timer_period_sec=10,
callback=self.send_goal,
callback_group=MutuallyExclusiveCallbackGroup(),
)
def send_goal(self):
goal_msg = SomeMsg.Goal()
# self._action_client.wait_for_server()
result = self._action_client.send_goal(goal_msg)
print(result)
def main(args=None):
rclpy.init(args=args)
action_client = SomeActionClient()
executor = MultiThreadedExecutor()
rclpy.spin(action_client, executor)
if __name__ == "__main__":
main()
server.py
import rclpy
from rclpy.action import ActionServer
from rclpy.node import Node
from custom_msgs.action import Home as SomeMsg
class SomeActionServer(Node):
def __init__(self):
super().__init__("fibonacci_action_server")
self._action_server = ActionServer(
self, SomeMsg, "some_action", self.execute_callback
)
def execute_callback(self, goal_handle):
self.get_logger().info("Executing goal...")
result = SomeMsg.Result()
return result
def main(args=None):
rclpy.init(args=args)
action_server = SomeActionServer()
rclpy.spin(action_server)
if __name__ == "__main__":
main()
Replication instructions:
- Run
server.py
on one terminal - Run
client.py
on one terminal
Observe client.py
calls failing. It will call every 10 seconds, however it will only actually end up getting called once since the race condition occurs and locks up a thread:
[WARN] [1691123997.104652764] [action_client]: Ignoring unexpected goal response. There may be more than one action server for the action 'some_action'
Are there any updates on this? At our company, we are also experiencing the same issue.
I am also seeing this when loading the system. Using the same callback group (Reentrant) did not solve the issue. I will try the the other suggested workaround.
Is it supposed to be fixed in the current humble or not or what is the progress? This is a show stopper for the application I have...
Still experiencing this issue on the current Humble release.
In my case, I encountered this in NavigateToPose action from nav2. If I call this action sequentially (for several poses) and insert time.sleep
between them to make some routine after one goal is succeeded, then sometimes this error (Ignoring unexpected goal response
) occurs.
Looks like the issue also still existent in iron. I agree with the other posters that this is an absolute showstopper for the usage of actions in any industrial system and should finally be prioritized by the ROS2 core team.
@clalancette if there is anything that can be done from a user side to help fixing this issue please let us know!
The rclcpp side btw. seems to be finally fixed by https://github.com/ros2/rclcpp/pull/2531 (at least in iron)
+1 we're also encountering this error. We're on Humble.
Also still encountering this error on Humble...