rclpy Race condition in `ActionClient.*_async` functions which result in a future that never complete

Bug report

Hi all, thank you for maintaining this wonderful library!

I've been encountering an intermittent bug that has left my head scratching. Unfortunately recently it started occurring more often, so I sat down and spent some time tracking down what was going on. Now, I've got a good idea of the problem- but I need some more expert eyes to decide on a good solution. I'll lay out all my findings and some perceived paths forward in this bug report.

Required Info:

Operating System:
- Ubuntu 22.04, within a docker image that is also running Ubuntu 22.04
Installation type:
- Binaries as installed from ros2 apt.
Version or commit hash:
- rclpy==3.3.7, python3.10
DDS implementation:
- CycloneDDS
Client library (if applicable):
- rclpy==3.3.7, python3.10

Steps to reproduce issue

There are multiple places where this race condition manifests itself. I will demonstrate how to reproduce it when calling send_goal_async. Take the ActionClient.send_goal_async and add a sleep(#) directly below the line sequence_number = self._client_handle.send_goal_request(request). For example:

    def send_goal_async(self, goal, feedback_callback=None, goal_uuid=None):
        """
        ...
        """

        if not isinstance(goal, self._action_type.Goal):
            raise TypeError()

        request = self._action_type.Impl.SendGoalService.Request()
        request.goal_id = (
            self._generate_random_uuid() if goal_uuid is None else goal_uuid
        )
        request.goal = goal
        sequence_number = self._client_handle.send_goal_request(request)
        sleep(5)   # <--- this line causes this race condition nearly 100% of the time

        if sequence_number in self._pending_goal_requests:
            raise RuntimeError(
                "Sequence ({}) conflicts with pending goal request".format(
                    sequence_number
                )
            )

        if feedback_callback is not None:
            # TODO(jacobperron): Move conversion function to a general-use package
            goal_uuid = bytes(request.goal_id.uuid)
            self._feedback_callbacks[goal_uuid] = feedback_callback

        future = Future()
        self._pending_goal_requests[sequence_number] = future
        self._goal_sequence_number_to_goal_id[sequence_number] = request.goal_id
        future.add_done_callback(self._remove_pending_goal_request)
        # Add future so executor is aware
        self.add_future(future)

        return future

Expected behavior

I would expect 'send_goal_async' to run a bit slower than usual, then return a future. That future might complete quite quickly (or already be complete).

Actual behavior

The future never completes, and I see the following line in the logs

Ignoring unexpected goal response. There may be more than one action server for the action '{ACTION_NAME}'

Additional information

By my sleuthing, it appears that this error is caused by the following important lines being called:

client:Thread1

send_goal_async is called, and within it:
sequence_number = self._client_handle.send_goal_request(request)
self._goal_sequence_number_to_goal_id[sequence_number] = request.goal_id

client:Thread2

A response to a goal request is received in the client C bindings, causing it to return True to is_ready(...)
execute(...) is called
if sequence_number in self._goal_sequence_number_to_goal_id: evaluates to False, because client:Thread1 has not reached its step 3
It ignores the taken_data, and thus the future will never have set_result called.

server:Thread1

A goal request is received, and responded to

Cause Summary

There is a race condition when sending goal requests, because the goal request can complete before the sequence_number is tracked in the self._goal_sequence_number_to_goal_id dictionary. Thus execute is called before it is tracked, and the taken_data is ignored, assumed to be a response from a duplicate action server.

I believe this same race condition affects all usages of the client API, such as _cancel_goal_async, or _get_result_async.

Recommendations

Put a lock around every usage of self._client_handle so that python-side objects can be edited and synced up. That way if execute is called it will block until the self._client_handle isn't being used, and the _goal_sequence_number_to_goal_id will be filled in.
Alternatively, create locks for the various in-memory dictionaries that are used in execute:

_goal_handles
_pending_goal_requests
_goal_sequence_number_to_goal_id
_pending_cancel_requests
_pending_result_requests
_result_sequence_number_to_goal_id

May 02 '23 19:05 apockill

Hi @apockill, I ran into the same issue. I was wondering if adding locks around dictionary _goal_sequence_number_to_goal_id will solve the problem or not. If there is a lock, it prevents data being written or read simultaneously but it doesnt synchronize the fact that self._goal_sequence_number_to_goal_id[sequence_number] = request.goal_id in send_goal_async needs to happen before if sequence_number in self._goal_sequence_number_to_goal_id check in execute ?

I believe to synchronize that, we would need some an event that is set only in send_goal_async and waited on in execute.

A lock around the entire send_goal_async and execute can overcome this? I tested that in this diff on my setup. Did not see the race condition.

May 04 '23 01:05 aditya2592

Hi @apockill, do you think this issue is also present in the rclpy ActionServer class ?. I have an action server in python and I have seen some requests getting dropped randomly.

May 08 '23 15:05 aditya2592

I haven't personally noticed that, but I can't say I've spent much time looking for any race conditions on the server side. What logs have you see printed when a request gets dropped- and are you sure it's not just a repeat of the client side dropping the ball?

May 08 '23 16:05 apockill

Any thoughts on the best way to resolve this bug? At the moment this affects all users of ActionClients. I can contribute better once there's guidance from maintainers :grin:

May 10 '23 16:05 apockill

It seems like the change I attempted didn't fully fix the problem, still see the following on some compute platforms

Ignoring unexpected goal response. There may be more than one action server for the action
Ignoring unexpected result response. There may be more than one action server for the action

May 10 '23 21:05 aditya2592

I tried looking deeper in client.py and it seems a simple condition/lock wont solve the problem because it doesn't guarantee that the dictionaries are updated before they are checked in async def execute. So I attempted to use a threading.Event to synchronize dictionary write and read. However each goal, result, cancel cases in async def execute need their own events to avoid write of one wrongly triggering the read of another. Trying this now in the existing PR https://github.com/ros2/rclpy/pull/1125

It still doesn't seem very clean, another possible solution could be to use multiple queue to send the dictionaries between threads instead of referencing them.

May 11 '23 00:05 aditya2592

I don't fully understand how a lock doesn't fix the issue in this case. Are you certain you were running the code with the lock when you observed this issue? 😅. It guarantees the dictionaries are updated at least in the send_goal_async case, because the execute function can't receive a sequence number until the client has been used in send_goal_async (which is inside a lock). Unless I am misunderstanding this race condition?

I have a similar fix on my end (I subclass ActionClient and lock the entirety of send_goal_async and execute) and it has solved the problem for me.

May 11 '23 14:05 apockill

Hello @apockill ! Other than the sleep() example, what is your use case where you're seeing this bug ? Is it a flaky one ?

May 11 '23 19:05 adityapande-1995

Hi @adityapande-1995! I gave the sleep example just to show how this race condition is purely caused by flaky timing, and luck currently prevents most users from seeing it. Also, a much much smaller wait can also trigger it.

I'm seeing this flakiness happen on a regular basis on my stack. We happen to call actions on a regular basis (maybe 1-3 actions per second, on average). This bug will reveal itself sometimes minutes after startup, other times within a few hours of runtime.

What mainly concerns me is that it's purely timing luck that this works at all, and the error message is confusing enough that other users might run into this but not know what happened. After all, the symptom is a confusing log message and a future that will never complete.

If we could find a way to protect the relevant python state tied to sequence_number's, I believe that will solve the problem.

May 11 '23 21:05 apockill

I don't fully understand how a lock doesn't fix the issue in this case. Are you certain you were running the code with the lock when you observed this issue? 😅. It guarantees the dictionaries are updated at least in the send_goal_async case, because the execute function can't receive a sequence number until the client has been used in send_goal_async (which is inside a lock). Unless I am misunderstanding this race condition?

I have a similar fix on my end (I subclass ActionClient and lock the entirety of send_goal_async and execute) and it has solved the problem for me.

You are right, lock does seem to solve the issue. Once I fixed the issue caused by the race condition I had another unrelated bug in my setup which was misleading me to believe that the issue is not fixed.

@adityapande-1995. Adding another viewpoint here: the dictionaries like _goal_sequence_number_to_goal_id are being edited and read in two threads: the thread that user is sending the goal request in and the thread in which the node spins which calls async def execute. So the race condition here is evident from the code itself.

May 12 '23 05:05 aditya2592

To others who think they might be running into this issue, here's a quick hack you can use that doesn't require building rclpy:

class PatchRclpyIssue1123(ActionClient):

    _lock: RLock = None  # type: ignore

    @property
    def _cpp_client_handle_lock(self) -> RLock:
        if self._lock is None:
            self._lock = RLock()
        return self._lock

    async def execute(self, *args: Any, **kwargs: Any) -> None:
        with self._cpp_client_handle_lock:
            return await super().execute(*args, **kwargs)  # type: ignore

    def send_goal_async(self, *args: Any, **kwargs: Any) -> Future:
        with self._cpp_client_handle_lock:
            return super().send_goal_async(*args, **kwargs)

    def _cancel_goal_async(self, *args: Any, **kwargs: Any) -> Future:
        with self._cpp_client_handle_lock:
            return super()._cancel_goal_async(*args, **kwargs)

    def _get_result_async(self, *args: Any, **kwargs: Any) -> Future:
        with self._cpp_client_handle_lock:
            return super()._get_result_async(*args, **kwargs)

May 17 '23 18:05 apockill

On another note- @adityapande-1995: Thoughts on a solution where there's some decorator called uses_cpp_client which can wrap all of the functions in the ActionClient which use the underlying C++ action client? For example:

def uses_cpp_client(fn):
    def wrapper(self, *args: Any, **kwargs: Any) -> Any:
        with self._cpp_client_handle_lock:
            return fn(*args, **kwargs)
        
    return wrapper

class ActionClient(Waitable):
    ...
   
    @uses_cpp_client
    def send_goal_async(...):
        ....

    @uses_cpp_client
    def _cancel_goal_async(...):
        ....

    @uses_cpp_client
    def _get_result_async(...):
        ...

    @async_uses_cpp_client
    async def execute(...):
        ...

May 17 '23 18:05 apockill

I was looking at the code for service clients and it also uses a lock to sync the C++ state (self.__client calls) with the python state (self.__pending_requests). I think this highlights that the issue on the action client side is just an oversight, and that using locks is the way to go.

I'll submit a PR!

Jun 14 '23 17:06 apockill

@apockill thanks for being patient on this.

I think this highlights that the issue on the action client side is just an oversight, and that using locks is the way to go.

i think that is correct assumption we have here. _pending_requests is protected by _lock in the service client.

Jun 14 '23 23:06 fujitatomoya

Sorry for the delay in looking at this.

I certainly can see how, in theory, we can get into this situation. However, so far I've been unable to replicate the problem as originally stated on the rolling branch. What I did was to edit action/client.py to add in the 5 second delay. Then I launched ros2 run action_tutorials_py fibonacci_action_server in one terminal, and ros2 run action_tutorials_py fibonacci_action_client in a second terminal. I do indeed see the 5 second delay on the client side, but I also see the client eventually catch up and receive the output.

Does someone have a relatively simple example that I could use to try to reproduce this?

Aug 02 '23 19:08 clalancette

Hi @clalancette, thank you for looking into this! I've completely forgotten to submit a PR for this.

Do you think that the difference might be in the executor? Come to think of it, I'm guessing this would only occur in a MultiThreadedExecutor case, since a SingleThreadedExecutor wouldn't be able to have a race condition like this.

Aug 03 '23 16:08 apockill

Do you think that the difference might be in the executor? Come to think of it, I'm guessing this would only occur in a MultiThreadedExecutor case, since a SingleThreadedExecutor wouldn't be able to have a race condition like this.

Yeah, that very well may be the case. If there is a simple example I can use to try to reproduce it, I'm certainly happy to take a look and review a PR.

Aug 03 '23 18:08 clalancette

Absolutely! Here's an example that replicated it on my end. You might have to switch out custom_msgs for some other action that takes no parameters.

client.py

from time import sleep

import rclpy
from rclpy import Future
from rclpy.action import ActionClient
from rclpy.callback_groups import MutuallyExclusiveCallbackGroup
from rclpy.executors import MultiThreadedExecutor
from rclpy.node import Node

from custom_msgs.action import Home as SomeMsg


class RaceyActionClient(ActionClient):
    def send_goal_async(self, goal, feedback_callback=None, goal_uuid=None):
        """
        Send a goal and asynchronously get the result.

        The result of the returned Future is set to a ClientGoalHandle when receipt of the goal
        is acknowledged by an action server.

        :param goal: The goal request.
        :type goal: action_type.Goal
        :param feedback_callback: Callback function for feedback associated with the goal.
        :type feedback_callback: function
        :param goal_uuid: Universally unique identifier for the goal.
            If None, then a random UUID is generated.
        :type: unique_identifier_msgs.UUID
        :return: a Future instance to a goal handle that completes when the goal request
            has been accepted or rejected.
        :rtype: :class:`rclpy.task.Future` instance
        :raises: TypeError if the type of the passed goal isn't an instance of
          the Goal type of the provided action when the service was
          constructed.
        """
        if not isinstance(goal, self._action_type.Goal):
            raise TypeError()

        request = self._action_type.Impl.SendGoalService.Request()
        request.goal_id = (
            self._generate_random_uuid() if goal_uuid is None else goal_uuid
        )
        request.goal = goal
        sequence_number = self._client_handle.send_goal_request(request)

        sleep(5)  # <--- this line causes this race condition nearly 100% of the time

        if sequence_number in self._pending_goal_requests:
            raise RuntimeError(
                "Sequence ({}) conflicts with pending goal request".format(
                    sequence_number
                )
            )

        if feedback_callback is not None:
            # TODO(jacobperron): Move conversion function to a general-use package
            goal_uuid = bytes(request.goal_id.uuid)
            self._feedback_callbacks[goal_uuid] = feedback_callback

        future = Future()
        self._pending_goal_requests[sequence_number] = future
        self._goal_sequence_number_to_goal_id[sequence_number] = request.goal_id
        future.add_done_callback(self._remove_pending_goal_request)
        # Add future so executor is aware
        self.add_future(future)

        return future


class SomeActionClient(Node):
    def __init__(self):
        super().__init__("action_client")
        self._action_client = RaceyActionClient(
            self,
            SomeMsg,
            "some_action",
            callback_group=MutuallyExclusiveCallbackGroup(),
        )
        self.call_action_timer = self.create_timer(
            timer_period_sec=10,
            callback=self.send_goal,
            callback_group=MutuallyExclusiveCallbackGroup(),
        )

    def send_goal(self):
        goal_msg = SomeMsg.Goal()
        # self._action_client.wait_for_server()
        result = self._action_client.send_goal(goal_msg)
        print(result)


def main(args=None):
    rclpy.init(args=args)

    action_client = SomeActionClient()

    executor = MultiThreadedExecutor()
    rclpy.spin(action_client, executor)


if __name__ == "__main__":
    main()

server.py

import rclpy
from rclpy.action import ActionServer
from rclpy.node import Node

from custom_msgs.action import Home as SomeMsg


class SomeActionServer(Node):
    def __init__(self):
        super().__init__("fibonacci_action_server")
        self._action_server = ActionServer(
            self, SomeMsg, "some_action", self.execute_callback
        )

    def execute_callback(self, goal_handle):
        self.get_logger().info("Executing goal...")
        result = SomeMsg.Result()
        return result


def main(args=None):
    rclpy.init(args=args)

    action_server = SomeActionServer()
    rclpy.spin(action_server)


if __name__ == "__main__":
    main()

Replication instructions:

Run server.py on one terminal
Run client.py on one terminal

Observe client.py calls failing. It will call every 10 seconds, however it will only actually end up getting called once since the race condition occurs and locks up a thread:

[WARN] [1691123997.104652764] [action_client]: Ignoring unexpected goal response. There may be more than one action server for the action 'some_action'

Aug 04 '23 04:08 apockill

Are there any updates on this? At our company, we are also experiencing the same issue.

Oct 12 '23 08:10 mpollayil

I am also seeing this when loading the system. Using the same callback group (Reentrant) did not solve the issue. I will try the the other suggested workaround.

Is it supposed to be fixed in the current humble or not or what is the progress? This is a show stopper for the application I have...

Dec 18 '23 16:12 tompe17

Still experiencing this issue on the current Humble release.

In my case, I encountered this in NavigateToPose action from nav2. If I call this action sequentially (for several poses) and insert time.sleep between them to make some routine after one goal is succeeded, then sometimes this error (Ignoring unexpected goal response) occurs.

Jun 15 '24 14:06 Fingerling42

Looks like the issue also still existent in iron. I agree with the other posters that this is an absolute showstopper for the usage of actions in any industrial system and should finally be prioritized by the ROS2 core team.

@clalancette if there is anything that can be done from a user side to help fixing this issue please let us know!

The rclcpp side btw. seems to be finally fixed by https://github.com/ros2/rclcpp/pull/2531 (at least in iron)

Jun 25 '24 12:06 firesurfer

+1 we're also encountering this error. We're on Humble.

Jul 09 '24 00:07 brta-jc

Also still encountering this error on Humble...

Aug 20 '24 06:08 akupferb

rclpy rclpy copied to clipboard

Race condition in `ActionClient.*_async` functions which result in a future that never complete

Bug report

Steps to reproduce issue

Expected behavior

Actual behavior

Additional information

Cause Summary

Recommendations

rclpy
rclpy copied to clipboard