verl
verl copied to clipboard
Support megatron 0.6 in veRL
Description
I am opening this PR with the hope of adding veRL support to Megatron 0.6 (although I noticed that the veRL paper seems to have already used Megatron 0.6 as the test version). From my naive perspective, I envision two possible approaches:
- Communication at the parameter level.
- Creating a MemoryBuffer in veRL that is fully aligned with the ParamAndGradBuffer in Megatron 0.6, and then performing broadcast and other communication operations based on this buffer.
In the current draft, when self._pp_rank == pp_rank, it directly uses the buffer defined in Megatron 0.6 (without even checking if use_distributed_optimizer is set), and communicates at the parameter level during parameter synchronization—this, of course, incurs some performance overhead.
At the very least, this approach seems feasible.
Testing
PPO training deepseek-llm-7b-chat on GSM8K
- convergence
- performance
PPO training deepseek-coder-6.7b-instruct on GSM8K + MATH
- convergence
- performance
The curve demonstrates that this pr does not negatively impact the training convergence. The performance showes different trends: improvement in one case and degradation in the other. The reasons for these differences require further analysis.
It looks really nice, we'll take some time to check how to align the two buffers to accelerate the resharding process.
Another question is whether there are any issues in MCore 0.6? If not, we may not need to patch the upstream megatron anymore.
When will this feature be merged into the main branch?
Hi @Chendong98, mcore has been upgrade to v0.11 in this PR. https://github.com/volcengine/verl/pull/392. Your contribution is acknowledged. Feel free to contact us if there is anything wrong. Thanks!