Add asynchronous rollout + reward stage to PPOTrainer
When training on code tasks, the reward stage can take quite a long time, as it requires compiling the model's output and running quite a lot of test cases. In some setups we have experienced the reward stage to take almost as much time as rollout (140s vs 180s respectively). Yet, it is possible to hide some of the reward stage's latency by overlapping it with the rollout stage.
It is possible to overlap the rollout and reward stages because 1) vLLM can asynchronously return the first trajectory before it finishes the rest 2) rollot is GPU bound and verification is CPU bound. This feature could be implemented by leveraging vLLM's AsyncLLMEngine API. I've looked into how this could be made in veRL and it seems that this feature would require changes to the DataProto, BaseRollout and *RewardManager APIs.
Would it be possible to implement something like this in veRL? If you are interested in a feature like this, but don't have the bandwidth, I could help out myself.
Has been implemented in https://github.com/agentica-project/verl-pipeline, which was used to make the the DeepCoder-14B model.
They saw 2.5x speedup in code RL training.
Great! Didn't know about this project! Are there any plans to add this functionality to veRL?
I think they mention that this is only done for the 1.5B not the 14B yet. Would defo love to see this merged into verl
I think they mention that this is only done for the 1.5B not the 14B yet
Yeah, it appears you are right. At least they provide the comparison for the 1.5B models. I'll keep this issue open then. Certainly seems like a good feature to have
@faresobeid @sunjin-k @dvmazur It seems that their implementation is based on an earlier version of vllm (before v0.8.2). At vllm==0.8.2 and 0.8.3, the model executor is background processes launched by AsyncLLM and we cannot access the model weights from AsyncLLMEngine anymore.
@youkaichao could u comment on why latest version of vllm limit weight handles?
And yes I agree using the async LLM engine sounds like a promising approach overall. We can use a compatible version of vllm to develop the feature while waiting compatibility patches from vllm main
why latest version of vllm limit weight handles
what do weight handles mean?
why latest version of vllm limit weight handles
what do weight handles mean?
I think it refers to access to model executor (the llm). In the implementation above, we can update the weights simply: https://github.com/agentica-project/verl-pipeline/blob/master/verl/workers/sharding_manager/fsdp_vllm.py#L99-L102.
I think since vllm 0.8.2, the model executor becomes background processes so that we cannot access the weights.
I have tried the asynchronous rollout approach in the verl-pipeline, but currently I’m facing an issue: when rollout_wg and actor_wg are separated, updating the vLLM parameters relies on Ray's communication plane instead of high-performance communication operators like NCCL. This increases the communication load on the driver process (since all data is distributed through the driver), which results in worse performance compared to the HybridEngine. In actual tests, communication takes 80 seconds, accounting for 20% of a 400-second step. This issue might require Ray to support GPU-level communication features in order to be resolved.
Hi! Any updates here? Maybe we should use SGLang instead of vllm for the requested pipeline from DeepCoder?
Hi! I noticed, veRL now support async rollouts for SGLang. I'm sure it would be possible to implement agentica's pipeline using it. Would you be interested in a feature like this? I could come up with a design / MVP if you are interested
Hi! I noticed, veRL now support async rollouts for SGLang. I'm sure it would be possible to implement agentica's pipeline using it. Would you be interested in a feature like this? I could come up with a design / MVP if you are interested
That would be great! Although I am not sure, how would you want to leverage sglang's async rollout?