verl icon indicating copy to clipboard operation
verl copied to clipboard

Add asynchronous rollout + reward stage to PPOTrainer

Open dvmazur opened this issue 8 months ago • 12 comments

When training on code tasks, the reward stage can take quite a long time, as it requires compiling the model's output and running quite a lot of test cases. In some setups we have experienced the reward stage to take almost as much time as rollout (140s vs 180s respectively). Yet, it is possible to hide some of the reward stage's latency by overlapping it with the rollout stage.

It is possible to overlap the rollout and reward stages because 1) vLLM can asynchronously return the first trajectory before it finishes the rest 2) rollot is GPU bound and verification is CPU bound. This feature could be implemented by leveraging vLLM's AsyncLLMEngine API. I've looked into how this could be made in veRL and it seems that this feature would require changes to the DataProto, BaseRollout and *RewardManager APIs.

Would it be possible to implement something like this in veRL? If you are interested in a feature like this, but don't have the bandwidth, I could help out myself.

dvmazur avatar Apr 08 '25 17:04 dvmazur

Has been implemented in https://github.com/agentica-project/verl-pipeline, which was used to make the the DeepCoder-14B model.

They saw 2.5x speedup in code RL training.

Image

sunjin-k avatar Apr 10 '25 02:04 sunjin-k

Great! Didn't know about this project! Are there any plans to add this functionality to veRL?

dvmazur avatar Apr 10 '25 14:04 dvmazur

I think they mention that this is only done for the 1.5B not the 14B yet. Would defo love to see this merged into verl

faresobeid avatar Apr 10 '25 21:04 faresobeid

I think they mention that this is only done for the 1.5B not the 14B yet

Yeah, it appears you are right. At least they provide the comparison for the 1.5B models. I'll keep this issue open then. Certainly seems like a good feature to have

dvmazur avatar Apr 10 '25 21:04 dvmazur

@faresobeid @sunjin-k @dvmazur It seems that their implementation is based on an earlier version of vllm (before v0.8.2). At vllm==0.8.2 and 0.8.3, the model executor is background processes launched by AsyncLLM and we cannot access the model weights from AsyncLLMEngine anymore.

SparkJiao avatar Apr 11 '25 04:04 SparkJiao

@youkaichao could u comment on why latest version of vllm limit weight handles?

eric-haibin-lin avatar Apr 12 '25 05:04 eric-haibin-lin

And yes I agree using the async LLM engine sounds like a promising approach overall. We can use a compatible version of vllm to develop the feature while waiting compatibility patches from vllm main

eric-haibin-lin avatar Apr 12 '25 05:04 eric-haibin-lin

why latest version of vllm limit weight handles

what do weight handles mean?

youkaichao avatar Apr 13 '25 08:04 youkaichao

why latest version of vllm limit weight handles

what do weight handles mean?

I think it refers to access to model executor (the llm). In the implementation above, we can update the weights simply: https://github.com/agentica-project/verl-pipeline/blob/master/verl/workers/sharding_manager/fsdp_vllm.py#L99-L102.

I think since vllm 0.8.2, the model executor becomes background processes so that we cannot access the weights.

SparkJiao avatar Apr 14 '25 02:04 SparkJiao

I have tried the asynchronous rollout approach in the verl-pipeline, but currently I’m facing an issue: when rollout_wg and actor_wg are separated, updating the vLLM parameters relies on Ray's communication plane instead of high-performance communication operators like NCCL. This increases the communication load on the driver process (since all data is distributed through the driver), which results in worse performance compared to the HybridEngine. In actual tests, communication takes 80 seconds, accounting for 20% of a 400-second step. This issue might require Ray to support GPU-level communication features in order to be resolved.

Wraythh avatar Apr 15 '25 07:04 Wraythh

Hi! Any updates here? Maybe we should use SGLang instead of vllm for the requested pipeline from DeepCoder?

Inf1delis avatar May 12 '25 21:05 Inf1delis

Hi! I noticed, veRL now support async rollouts for SGLang. I'm sure it would be possible to implement agentica's pipeline using it. Would you be interested in a feature like this? I could come up with a design / MVP if you are interested

dvmazur avatar May 27 '25 11:05 dvmazur

Hi! I noticed, veRL now support async rollouts for SGLang. I'm sure it would be possible to implement agentica's pipeline using it. Would you be interested in a feature like this? I could come up with a design / MVP if you are interested

That would be great! Although I am not sure, how would you want to leverage sglang's async rollout?

eric-haibin-lin avatar Jun 04 '25 22:06 eric-haibin-lin