verl next steps
Introduction
We observe a wide range of adoption of verl in various RL research papers and production systems over the last 6 months. And most of them built on top of verl to develop advanced and customized features such as asynchronous RL, advanced data sampling, replay policies, multi-modal RL including audio, image and video, etc,. We acknowledge that it is impossible for verl to adopt all these great features. So, instead of building verl as a monolithic repo, we would like to make verl a composable and customizable library so that people can easily build on top of it (e.g., import core components and start their own repo). Moreover, we believe that a multi-backend system with well-defined APIs is essential in the long term for easy integration of different training and inference systems. In light of these observations, we propose the following approach.
Approach
A RL library can be decomposed into 5 parts: 1) Rollout Engine; 2) Model Engine; 3) Weight Transfer Engine; 4) Agent Loop and 5) Data Transfer System. Each component acts as a service, whose backend is agnostic to the RL system. RL system is a single controller that manipulates the data flow of those service and add customized algorithm-specific components such as replay buffer.
Rollout Engine
Recent RL infrastructures adopt native server mode rollouts (e.g., slime). This makes integration of new features in inference engine easy and provide a great abstraction between rollout engine and the rest of the RL systems. This also makes supporting multi-backend rollout system easy as the common abstraction of various inference backend simply becomes a http endpoint. verl is migrating to this design.
Model Engine
Similar to rollout engine, verl will also make model engine as a service. In order to do so, a well-defined interface is necessary. Also, it is important to make the model engine extensible to more frontier model architectures such as Qwen3-Omni, VLA and even diffusion models. By defining a model engine, high level trainers (e.g., SFT/DPO/RM) and workers (e.g., Actor/Critic) can reuse most of the code by simply changing the loss function and the dataloader. Currently, verl supports fsdp backend and mcore backend. We found them inadequate in supporting latest MoE models such as Qwen3-Next and Qwen3-VL-MoE, and will start to integrate more backends when the model engine is ready. Please refer to https://github.com/vermouth1992/verl/blob/chi/dev/roadmap/docs/README_model_engine.md for more details.
Weight Transfer Engine
verl will gradually abandon the idea that places the rollout and model in the same process and move towards separate process design. Model engine and rollout engine have to expose APIs such that weight transfer can be performed with a backend-agnostic engine with both Cuda-IPC and NCCL. verl will integrate the design illustrated in https://github.com/MoonshotAI/checkpoint-engine.
Agent Loop
Agent framework (e.g., SWE-Agent) typically works in the style of OpenAI Gym fashion, which is basically a simple loop
import env_lib
env = env_lib.make(env_id)
obs, done = env.reset() # prompt
while not done:
response = llm.call(obs)
action = extract_action(response)
next_obs, reward, done, info = env.step(action) # tool call or env interaction
obs = next_obs
Agent loop in verl is the interface that connects agent/env framework and the RL training framework. Most agent framework interacts with LLM using standard OpenAI Compatible server via string in and string out. This will cause issues in RL training as tokenizer decode + encode is not revertible. Thus, users have to handle
- Token in token out
- Convert trajectory into a format that can be consumed by the trainer
in the customized Agent Loop.
We will write a detailed instruction about how to extend to new environment/agent framework shortly.
Data Transfer System
One of the biggest challenge of single controller programming abstraction is the data transfer overhead between services. To tackle this, community has proposed TransferQueue prototype. The core idea is to decouple data management from data storage, enabling services to only pass data reference while actual data is fetched directly point-to-point. It enables sample-level data routing across the entire post-training system, preserving the flexibility of a single-controller architecture while minimizing data transfer overhead. verl will adopt this design once the prototype is ready.
Let's embrace a guiding principle from early DeepMind's RL engineering:
Be a library, not a framework.
This philosophy empowers innovation by providing flexible tools, not rigid structures. In that spirit, let's build on verl-core instead of forking the entire verl repo.
seems that slime already incorporates most of the designs mentioned above?
Hi VERL community,
I'm exploring the potential of integrating the VERL rollout engine with an advanced LLM request scheduler called EPP. I believe its specialized scheduling optimizations for LLM workloads could bring significant benefits to RL rollout workloads.
To give some brief context, EPP is a smart scheduler that handles:
- Prefix-cache aware scheduling
- Load-aware scheduling
- Disaggregated serving
- [Coming soon] Data-parallel aware scheduling for MoE models
I see that the VERL architecture is moving towards a more composable and customizable model from a monolithic repo. In parallel, the EPP community is planning to evolve EPP into a standalone (kubeless) component, which would make it consumable outside of Kubernetes.
I'm curious to learn about the VERL community's initial thoughts on enabling the rollout engine to integrate with external inference systems through well-defined APIs. Any high-level feedback or pointers on this would be greatly appreciated as I investigate the feasibility of plugging EPP into the VERL stack. Thanks!
@capri-xiyue , The current design of verl allows integration and registration of rollout servers to a router or you can modify the destination to any other EPP like system. Once your scheduler knows the IP and ports of these rollout servers you can easily plugin these amazing feature rich inferencing.
Thanks for the insightful post, @vermouth1992 . Do you have further details on this? I currently think the best way to integrate gym-like environments in hacking them into the SGLang rollouts with Interaction Systems?
@capri-xiyue , The current design of verl allows integration and registration of rollout servers to a router or you can modify the destination to any other EPP like system. Once your scheduler knows the IP and ports of these rollout servers you can easily plugin these amazing feature rich inferencing.
@bhks Is there any existing docs or example regarding how to do such plugin configuration in verl?
@capri-xiyue This pull request have a diagram which can be helpful : https://github.com/volcengine/verl/pull/3456
You can schedule or run your agents outside verl by implementing AgentLoopManager and then register all of the llm engines to your smart router. Your router adress is what goes to agents and then agents talk to router, router performs , prefix-aware or disaggregated or other algorithms based routing to targets.
Hope its helpful.