lerobot icon indicating copy to clipboard operation
lerobot copied to clipboard

Add Async Inference

Open fracapuano opened this issue 6 months ago • 4 comments

What this does

This PR adds support for async inference.

  • async inference: decoupling of action prediction and execution, aimed at improving on adaptability to environment by surpassing the sequential-inference paradigm.

This PR uses gRPC as a communication protocol (why: ~5x faster than HTTP).

TLDR

A robot_client periodically sends observations to a policy_server, and reads actions that it enqueues and consumes. Critically, the actions read on the robot_client are obtained querying inference on the remote server.

How it was tested

Tested on real-world SO100 arm, using both dummy and actual policies.

Demo (video)

https://x.com/_fracapuano/status/1920183316335956053

How to checkout & try? (for the reviewer)

Install the required dependancies to test this feature out:

pip install grpcio grpcio-tools

Launch the policy server:

python lerobot/scripts/server/policy_server.py

Once the server is live, launch the robot client specifying the task using a text instruction

python lerobot/scripts/server/robot_client.py --task 'fold my t-shirt'

cc. @Cadene @mshukor as we discussed about this feature 🤗

fracapuano avatar Jun 03 '25 16:06 fracapuano

A few comments

This PR introduces support for Async inference. Essentially, this decouples (1) action execution from (2) action prediction. This is particularly relevant considering the tendency of current models ([1], [2], [3]) to be outputting chunks of actions ($a_{t:t+n}$) rather than single actions ($a_t$) given an observation ($o_t$).

How do I know it works

This demo video shows (1) action execution from a VLA-like policy. Critically, there is ~no moment where the robot is waiting to obtain new actions to perform, resulting in an absence of lag at execution time (something the current sequential inference stack inevitably presents, [3]

Demo thread: https://x.com/_fracapuano/status/1920183316335956053

Async vs. Sequential

Action execution is decoupled from action prediction by hosting the action execution process on a client (RobotClient) receiving chunks of actions predicted from a remote PolicyServer. Compared to the standard sequential approach (send observation, compute actions, act, send observation, ...), this decoupled approach enables (1) no lags and (2) more adaptive policies. The following table presents a comparison between the two approaches:

Sequential Inference Asynchronous Inference
sync async_inf
Sequential inference computes actions, acts, and then recompute actions, making the robot idle at runtime while computing actions, and thus inducing lags and lack of responsiveness while executing a chunk of actions. Asynchronous inference sends an intermediate observation to trigger action generation and avoid action prediction lags at runtime. From a scientific perspective, asynchronous inference relies on the hypothesis of a good world model more than Sequential Inference does.

Solving both these issues---lags and lack of responsiveness while going through an action chunk---requires separating the process of action computation from action execution.

fracapuano avatar Jun 03 '25 16:06 fracapuano

Analyzing Async Inference

Separating action execution from action computation presents various benefits. For starters, policies produce chunks of actions defined as:

\pi: \mathcal O \mapsto \tilde{\mathcal A}, \ \text{s.t.} \begin{pmatrix} a_{t} \\ a_{t+1} \\ \vdots \\ a_{t+n} \end{pmatrix} = \pi(o_t)

Under the assumption of a good-enough implicit word model in the policy, it follows that is reasonable to expect $\pi(o_k)_j \simeq \pi(o_k+q)_j$, where $q \in \mathbb Z$ is an integer representing the number of timesteps between policies' outputs, and $j$ is the collection of overlapping indices, i.e. the indices for which both $\pi(o_k), \pi(o_k+q)$ contain predicted actions.

Therefore, one can effectively start computing the action chunk for future observations, while still stepping through an action chunk. Visually,

async_inference drawio

In such a configuration, there are three main moments (highlighted in red).

  1. Initialization: when the client spawns, it sends the initial observation collected to the remote server for action prediction. Because the client has never received any actions, the robot stays still until a chunk of $n$ actions is received after inference_latency seconds (assum ing communication times are negligible with respect to inference time and the environment discrete dt). When the actions are received from the server, they are enqueued, and the robot starts performing them sequentially stepping through the queue
  2. Sending another observation: When the action queue reaches a critical portion of the maximal action chunk size, the client sends an observation to the server, querying for new actions. As the server computes the new actions, the client continues to step through the queue.
  3. Receiving incoming actions: When new incoming actions are received, they are integrated with the remaining actions in the queue. The aggregation function allows to obtain aggregate actions on overlapping timesteps between the current action queue and the incoming one. Then, the robot steps through this aggregate queue until it reaches 2. again.

Interestingly, key factors of this process are (1) inverse-frequency at which the robot's environment evolves (environment_dt in the visualization and code) (2) the inference time on the policy server's side and (3) how often observations are sent to the remote server for inference. For starters, the ratio $c=\tfrac{\texttt{environment-dt}}{\texttt{inference-latency}}$ between (1) and (2), directly influences the behavior of the robot client:

  • When $c \to 0$, the environment is evolving much faster than new actions can be computed. As a consequence of this, queues are consumed much faster, and even with this async inference stack one inevitably converges towards sequential-like behavior.
  • Whenever $c \geq 1$, the environment steps as fast as new chunks are computed. This means the overlap between current and incoming actions is always larger than action_chunk_size - 1, and that the size of the action queue stays close to the action_chunk_size.

However, both these conditions for $c$ are rather speculative and analytical only, as one typically does not have control over inference_time. While this async inference stack allows to align inference_time with environment_dt by serving PolicyServers on accelerated platforms for which the inference is typically much faster than cpu-based inference, we seek to avoid relying on hyper-specialized hardware platforms to influence runtime behavior.

To attain the goal of driving $c \to 1$, one can also directly control the other key parameter of this design, the critical size $g=\tfrac{k}{n}$ at which new observations are sent. The visualization below shows a comparison between three different behaviors varying the critical queue size $g$ only (expressed as a fraction of the maximal queue size).

$g=0.0$ $g=0.7$ $g=1.0$
Screenshot 2025-05-03 at 15 10 04 Screenshot 2025-05-03 at 15 08 17 Screenshot 2025-05-03 at 15 14 45

Interestingly, $g=0$ reproduces the same behavior of sequential-like inference. Because the action queue is completely exhausted before a new observation is sent to the server, there is a gap in the robot's acting equal to the time it takes the server to receive the observation, compute the new action chunk and send it to the robot client. During this time, the queue is empty and the robot is incapable to act.

On the other side of the spectrum, setting $g=1.0$ triggers sending out an observation to the remote server at every timestep, which results in an always almost-filled queue where the overlap between timesteps is quite relevant. Because $c < 1$, the queue is not exactly full. However, running inference might be costly, which is why allowing the client to go through $1-g$% of its current chunk before sending out a new observation might be a more compute-efficient strategy.

Critically, among the three configurations presented, $g=0$ exploits the least world-modeling ability of the policy used, while $g=1$ strongly relies on this assumption to prevent incorrect or erratic behavior. $g \in [0,1]$ strikes a(n unclear) balance between these two situations.

Lastly, worth noting this code has been tested for deployment on the internet, so that one can host a policy server on a GPU-enabled server to receive and perform actions with minimal lag on a given robot client :)

fracapuano avatar Jun 03 '25 16:06 fracapuano

Filtering out observations

Not all observations need to be processed. Some might be safely discarded, especially in circumstances where the robot is stuck. If not, this could induce spiraling of the robot---stuck in an action fetching loop without the possibility to act it out, and therefore change the observation to exit such an impasse.

  • fbe8b6a3ba7731d9b599c9d494b5271c4f126ba0 fixes this
  • Observations on the robot client are now checked on the policy server side to make sure they should indeed be processed.
  • When sent observations do not satisfy processing conditions, then they are not enqueued for processing on the policy server (i.e., the client does not receive new actions)
  • If the queue on the robot client becomes empty, observations are marked with a must_go flag. This triggers processing on the policy server side regardless of check---avoiding the robot from being idle

This means actions queues behave slightly different from what illustrated above: the robot might be left with no actions in the queue (at which point it would query for a new action chunk via must_go) if it sends significantly-similar observations.

Screenshot 2025-05-12 at 16 36 39

For now, observations are exclusively compared in terms of similarity between the joint spaces of the two robots.

fracapuano avatar Jun 03 '25 16:06 fracapuano

What this does

This PR adds support for async inference.

  • async inference: decoupling of action prediction and execution, aimed at improving on adaptability to environment by surpassing the sequential-inference paradigm.

This PR uses gRPC as a communication protocol (why: ~5x faster than HTTP).

TLDR

A robot_client periodically sends observations to a policy_server, and reads actions that it enqueues and consumes. Critically, the actions read on the robot_client are obtained querying inference on the remote server.

How it was tested

Tested on real-world SO100 arm, using both dummy and actual policies.

Dummy policy:

  • Rather than predicting a series of actions for a given observation, it just streams actions from a pre-recorded dataset

Actual policy:

  • (severely undertrained, steps=2) ACT policy, used mainly to assess the complete inference pipeline

Demo (video)

https://x.com/_fracapuano/status/1920183316335956053

How to checkout & try? (for the reviewer)

Install the required dependancies to test this feature out:

pip install grpcio grpcio-tools

Launch the policy server:

python lerobot/scripts/server/policy_server.py

Once the server is live, launch the robot client specifying the task using a text instruction

python lerobot/scripts/server/robot_client.py --task 'fold my t-shirt'

cc. @Cadene @mshukor as we discussed about this feature 🤗

thx for providing async inference, the script (client & policy server) seemed for only real robot,could u provide the async inference for simulation tasks?hope for ur early reply,thx in advance~

JuilieZ avatar Jun 10 '25 09:06 JuilieZ

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

”3. Receiving incoming actions: When new incoming actions are received, they are integrated with the remaining actions in the queue. The aggregation function allows to obtain aggregate actions on overlapping timesteps between the current action queue and the incoming one. Then, the robot steps through this aggregate queue until it reaches 2. again.“

I have a question about the action chunk fusion approach: if an action chunk has 30 steps, inference is triggered at step 15, and inference takes ~5 steps (so the new chunk is received at step 20), does the method directly merge the new chunk with the old chunk's remaining actions? The concern is that the new chunk is generated from step-15 observations, but the robot arm moves during inference, so its state at step 20 differs from step 15. Directly executing the new chunk might cause jitter due to this state mismatch. Would it make sense to discard the first 5 steps of the new chunk before proceeding, to align with the current arm state? Appreciate your thoughts on this.

zttiannnn avatar Jul 04 '25 10:07 zttiannnn

Hey @zttiannnn thank you for your question.

Whenever an incoming chunk is received, the actions (all labeled timestep-wise as instances of TimedAction) are inspected, and filtered out whenever their timestep index is lower than the current timestep. That's to say, in your example, that the client would loop through the actions generated for timestep 15, and remove action15, action16, ..., action20, compute the next aggregate chunk looking at all the actions from action21 onwards (but only on overlap with the previous one).

The following visual should help you out: from_client_perspective

fracapuano avatar Jul 04 '25 16:07 fracapuano

I have checked more or less the code, added comments. Also I have provided a fixes for some comments - https://github.com/huggingface/lerobot/pull/1441

helper2424 avatar Jul 08 '25 18:07 helper2424

Hi all, sorry to comment on a closed MR but I don't understand why the communication btw robot an policy was implemented trough gRPC here. In other parts of the codebase you used zmq/http based communication. I was just wondering if there was a specific reason to use gRPC here instead

mgiac-hexagon avatar Aug 11 '25 11:08 mgiac-hexagon

Hey @mgiac-hexagon 👋 Thank you for asking ⭐ The rationale behind using gRPC was that we wanted fast communication (the robot is literally waiting for a list of actions to execute, so we want to limit bottlenecks). Some people within the team indicated gRPC as a particularly fast way of communicating, and I have found it fast enough for our use cases earlier in the dev cycle, hence the decision of using it.

Personally, I wasn't familiar with gRPC at all when I started working on this, so I know it can indeed be a bit daunting at first, but I think the code's structure (mostly thanks to @imstevenpmwork) is clear enough now that you should be able to make sense of it relatively easily. If that's not the case, by all means please do feel free to open an issue and tag me 🤗

fracapuano avatar Aug 24 '25 06:08 fracapuano