verl icon indicating copy to clipboard operation
verl copied to clipboard

Make DAPO rollout faster and more efficient (Refactor ShardingManager)

Open sanghyun-son opened this issue 7 months ago • 4 comments

Thank you for sharing the great codebase.

While experimenting with DAPO, I observed that model resharding/offloading occurs multiple times when filter_groups is enabled. This is due to the current ShardingManager context reverting all sharded/offloaded models at __exit__, which becomes inefficient - especially with large models, e.g., 70B - when the context is re-entered multiple times without model updates.

To address this, I refactored the lifecycle of ShardingManager into three separate functions: enter, rollout, and exit. This allows the model to be sharded/offloaded and later reverted only once, rather than at every rollout step. __enter__ and __exit__ operates as same as before.

To minimize interface changes, I kept some dummy arguments (e.g., dummy input to setup_generate_sequences_efficient and teardown_generate_sequences_efficient), which can be revisited later. Feedback on this approach or implementation details is highly appreciated.

Note: This PR is not yet ready to be merged. Pending tasks include:

  • Add support for Megatron (currently FSDP-only)
  • Test & benchmark performance improvements
  • Polish the implementation for readability and maintainability

Please let me know if I overlooked anything. Thanks in advance!

sanghyun-son avatar Apr 07 '25 12:04 sanghyun-son

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Apr 07 '25 12:04 CLAassistant

Verified the execution in my environment and observed approximately 20% speedup in rollout generation using the Qwen32B model (350s → 300s).

sanghyun-son avatar Apr 09 '25 13:04 sanghyun-son

Verified the execution in my environment and observed approximately 20% speedup in rollout generation using the Qwen32B model (350s → 300s).

excellent!but i can not understand the second step rollout generater_seq without enter func is ok?it seems lack some model weights sync?can you explan this for me ? i am a Vegetable Chicken,Looking forward to your answer

PrometheusComing avatar May 27 '25 13:05 PrometheusComing

Verified the execution in my environment and observed approximately 20% speedup in rollout generation using the Qwen32B model (350s → 300s).

excellent!but i can not understand the second step rollout generater_seq without enter func is ok?it seems lack some model weights sync?can you explan this for me ? i am a Vegetable Chicken,Looking forward to your answer

ok,i know the reason now,haha,it works on num_gen_batches scenarios

PrometheusComing avatar May 27 '25 13:05 PrometheusComing