verl icon indicating copy to clipboard operation
verl copied to clipboard

[roadmap] verl Q3 development

Open eric-haibin-lin opened this issue 5 months ago • 2 comments

Past roadmap dicusssions for reference: https://github.com/volcengine/verl/issues/710 https://github.com/volcengine/verl/issues/22

The most important thing for verl Q3 is to make it a modular foundational library for the community to extend, as a starting point but not the destination.

composable model engines

Finish up https://github.com/volcengine/verl/discussions/1560 such that parallelism strategy is not implemented at the engine level, without exposing details to the worker(role) level. The fsdp/megatron engines are expected to be created and run in a standalone fashion, and be reused across different roles.

  • [x] fsdp actor, critic, ref (focus on fsdp2)
  • [ ] megatron actor, critic, ref
  • [ ] torchtitan integration (call for contribution)
  • [ ] switch all recipe/examples from fsdp1 to fsdp2 by default (and remove ill-maintained ones)

Work in progress interface for comments https://github.com/volcengine/verl/pull/1977

rollout workers

  • [ ] optimize server mode rollout performance
  • [ ] modular rollout workers: VllmRolloutWorker and SGLangRolloutWorker, exposing the same APIs
  • [ ] support model with random init weight
  • [ ] weight resharding: optimize tp x dp dispatch, and support receiving weight from separate resource groups
  • [ ] Agent RL infrastructure https://github.com/volcengine/verl/issues/2618

Additional ongoing efforts:

  • https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/issues/131
  • https://github.com/volcengine/verl/issues/1882

async & disaggregated architecture

  • [x] one-step off async pipeline (WIP: https://github.com/volcengine/verl/pull/2231), further performance optimization & profiling needed
  • [ ] streaming/partial rollout (WIP: https://github.com/volcengine/verl/pull/2200)
  • [ ] performance tuning, and reference throughput benchmark across [model type, model size, seqlen, hardware, num accelerators, worker role] to achieve better disaggregated resource allocation
  • [ ] fully-async pipeline

multi-turn, data, config infra

  • [ ] better message infra for multi-turn messages, dense reward @SwordFaith
  • [ ] better dataset schema for train & rollout. We need documentation too. TRL's documentation is good https://huggingface.co/docs/trl/en/dataset_formats @SwordFaith
  • [ ] use tensordict and nested-tensor to remove padding and replace DataProto
  • [ ] replace omegaConfig with read-only dataclass for verl internal config passing https://github.com/volcengine/verl/pull/2379 https://github.com/volcengine/verl/pull/2147/files and make unit test easier
  • [ ] P1: distributed data pool from https://arxiv.org/pdf/2507.01663v1 https://github.com/volcengine/verl/issues/2539

streamline new model workflow

  • [ ] document the workflow to add a new hf model to verl. Currently with latest vllm there's no need to add weight loader mentioned in https://verl.readthedocs.io/en/latest/advance/fsdp_extension.html
  • [ ] better abstraction and registration system for multi-modal models. Currently different multi-modals have inconsistent config attr (e.g. rope), freeze/unfreeze setup, input/output processing... (ideally this should be done at huggingface transformers level but it's not sufficient right now cc @NielsRogge) (RFC needed)
  • [ ] verl needs a documentation page about the latest status of model support and per model related features (lora, sequence parallelism, megatron, etc)

high quality recipes and end2end optimizations

  • [x] retool recipe (code is ready, going through reviews)
  • [ ] SOTA multimodal vlm RL recipe (call for contribution)
  • [ ] enhance DAPO recipe with larger models, and provide scripts with high training throughput (many perf knobs are not turned on in the current script)
  • we welcome more recipes from the community, please open an RFC if you're interested in contributing before opening any PR for recipes https://github.com/volcengine/verl/issues/2136

Additional existing ongoing features:

  • https://github.com/volcengine/verl/issues/1033
  • https://github.com/volcengine/verl/discussions/2171

Many roadmap tasks in this doc are initiated by & credit to @vermouth1992 @SwordFaith

eric-haibin-lin avatar Jul 07 '25 00:07 eric-haibin-lin

Please let me know which task I can start with and will take up those ? Do we have any community meeting and slack or other medium we are using for communication ?

bhks avatar Oct 16 '25 05:10 bhks

The code is very good. Can you support the latest rollout PP?

mpj1234 avatar Dec 01 '25 08:12 mpj1234