[RFC] Reward Loop
Reward Loop has been implemented in the current main branch in verl/experimental/reward, and will refactor almost the full reward computation pipeline. This issue provides an explanation of Reward Loop and is open to suggestions from the community.
Motivation
Specifically, Reward Loop will depreciate some legacy implementations.
- For the reward models (rm) scenario:
- Depreciate Legacy FSDP/Megatron RM Implementation (Both GenRM and DisRM)
- As an alternative, the Reward Models will be launched as multiple VLLM/SGLang servers and one router to handle incoming RM requests, utilizing some load-balancing strategies.
- For the non-RM reward scenario:
- Depreciate Legacy RewardManager, which accepts a full batch and computes the reward for each sample sequentially, resulting in low efficiency.
- As a substitute, Reward Loop will launch multiple workers (CPU-only) and process samples asynchronously. And this can be integrated with Agentloop, where once a sample is rollouted, it can directly compute reward, without the need to wait for the full batch to complete.
- Flexible and user-friendly customized reward function
- Many current algorithms and tasks require flexible reward function design, such as (1) integrating rule-based functions with feedback from reward models, (2) even multiple reward models. This cannot be gracefully handled in the current implementation.
- Reward Loop aims to provide flexibility for user-customized functions. This will change the interface of the reward function, by passing more arguments like
reward_router_address.
API Design
RewardModelManager
This is for the reward model scenario. RewardModelManager will launch multiple VLLM/SGLang workers & one Router, and expose only reward_router_address.
RewardManagerWorker
This is a remote Ray class to process incoming reward computation requests.
- It will launch the reward manager like DAPO, Naive, ......
- For the RM scenario, it will pass router address to user-customized reward function like
compute_score(..., reward_router_address) - For the non-RM scenario, the behavior is the same as the legacy
RewardManager, only changing the legacy batch-mode to sample-wise async mode.
integrated with agent loop rollout
[class name to be filled] (under progress)
For the colocate mode (where the rollout server and reward server are colocated in the sample resource pool), we need an extra class to:
- Run in single controller and has the method
compute_rm_scores(batch: DataProto) -> DataProto - Init with multiple
RewardManagerWorkerto handle incoming batch to compute rewards
Also, we note that some class names may confuse (RewardModelManager, RewardManagerWorker, XXXRewardLoopManager), and we are open to suggestions.
@wuxibin89 @vermouth1992 @PeterSH6 👀
那现在还能按照原来的方式自定义reward manager么
You can hard-code enable_async_reward = False in verl/experimental/agent_loop.py to disable the reward loop.
BTW, reward loop will become the default, and the old reward manager will be deprecated.
In my view, the reward loop can cover all features of the legacy reward manager. And users have no need to change the code. If you have any use cases that the reward loop cannot support, please feel free to bring them up.
UPD: You can add reward_model.use_reward_loop=False to customize reward manager as previously do.
Can you share some scripts/examples to use genRM in reward loop?
Please refer to recipe/fapo/run_fapo_7b.sh for example usage and this doc for more user instructions and architecture designs.
Hi I m having the same issue as #4346 with rewardmanagerLoop For when i do custom reward manager. Is there any example on how to define custom rewardManagerLoop? run_fapo_7b.sh doesn't use rewardManagerLoop and the doc only explaint he implementation and doesn't give examples.
Thanks for the great feature.
enable_async_reward = (
self.reward_router_address is not None and self.config.reward_model.enable_resource_pool
) or not self.config.reward_model.enable
Does this disallow using reward vLLM server in collocate mode?
@yumikim381 Hi, relevant docs will be added soon. You can also refer to the examples in verl/experimental/reward/reward_loop/*.py, where you should inherit from the RewardLoopManagerBase class and implement the run_single method.
@Liang-Qiu Thanks for your attention. Colocate mode will allow using reward loop disrm in https://github.com/volcengine/verl/pull/4466. BTW, in colocate mode, reward workers are launched in single controller rather than agent loop worker.
The current Reward Loop implementation appears to launch multiple workers for parallel reward computation, but I'm using an external Generate Reward Model service with limited concurrency capacity. I cannot find any configuration parameter or mechanism to control the maximum number of concurrent requests sent to the external RM service. Without proper concurrency limiting, this could overwhelm our external service and cause request failures or degraded performance.
Question: Is there a built-in way to configure the maximum concurrent requests to external reward model services? If not, could this feature be added to prevent overloading external RM APIs?
@nantenT You can use the rate limit reward manager by setting reward_model.reward_manager=limited, with implementation details in verl/experimental/reward/reward_loop/limited.py.
You can directly modify the concurrency parameters in https://github.com/volcengine/verl/blob/01ab536cdfa9ffdd36d9b8996448c6b680fbe695/verl/experimental/reward/reward_loop/limited.py#L264-L307 (as a built-in way).
@yyDing1 请问一下,如果我将 enable_async_reward = ( self.reward_router_address is not None and self.config.reward_model.enable_resource_pool ) or not self.config.reward_model.enable 硬编码为: enable_async_reward = False 是不是奖励计算会回退到verl/verl/workers/reward_manager/naive.py
You can directly set reward_model.use_reward_loop=False to use the batch reward manager.
why it is called rewardLoop by the way? I don't see any sort of loop in reward manager tho
@paipeng-quiver Good question. The name reward loop is mainly motivated by the design goal and abstraction.
- The reward loop is designed for hybrid reward scenarios, where users may compose multiple reward sources, e.g., different discriminative RMs (disRM), generative RMs (genRM), and rule-based rewards, and define how they are grouped and combined. In this setting, reward computation is not a one-shot function call (which previous reward does) but a user-defined interaction pattern between the policy outputs and multiple reward components, this is what "loop" does.
- Reward Loop follows the design philosophy of Agent Loop, i.e., rewards are computed asynchronously. Even if the reward manager itself does not expose an explicit loop, the overall reward computation still proceeds in a cycle of questions -> agent loop → reward loop → rollout with reward score.
So the term rewardLoop is intended to capture the idea that reward computation is an ongoing, and asynchronous process, rather than a single static one-function pass.