verl icon indicating copy to clipboard operation
verl copied to clipboard

[RFC] Reward Loop

Open yyDing1 opened this issue 1 month ago • 3 comments

Reward Loop has been implemented in the current main branch in verl/experimental/reward, and will refactor almost the full reward computation pipeline. This issue provides an explanation of Reward Loop and is open to suggestions from the community.

Motivation

Specifically, Reward Loop will depreciate some legacy implementations.

  • For the reward models (rm) scenario:
    • Depreciate Legacy FSDP/Megatron RM Implementation (Both GenRM and DisRM)
    • As an alternative, the Reward Models will be launched as multiple VLLM/SGLang servers and one router to handle incoming RM requests, utilizing some load-balancing strategies.
  • For the non-RM reward scenario:
    • Depreciate Legacy RewardManager, which accepts a full batch and computes the reward for each sample sequentially, resulting in low efficiency.
    • As a substitute, Reward Loop will launch multiple workers (CPU-only) and process samples asynchronously. And this can be integrated with Agentloop, where once a sample is rollouted, it can directly compute reward, without the need to wait for the full batch to complete.
  • Flexible and user-friendly customized reward function
    • Many current algorithms and tasks require flexible reward function design, such as (1) integrating rule-based functions with feedback from reward models, (2) even multiple reward models. This cannot be gracefully handled in the current implementation.
    • Reward Loop aims to provide flexibility for user-customized functions. This will change the interface of the reward function, by passing more arguments like reward_router_address.

API Design

RewardModelManager

This is for the reward model scenario. RewardModelManager will launch multiple VLLM/SGLang workers & one Router, and expose only reward_router_address.

RewardManagerWorker

This is a remote Ray class to process incoming reward computation requests.

  • It will launch the reward manager like DAPO, Naive, ......
  • For the RM scenario, it will pass router address to user-customized reward function like compute_score(..., reward_router_address)
  • For the non-RM scenario, the behavior is the same as the legacy RewardManager, only changing the legacy batch-mode to sample-wise async mode.
Image

integrated with agent loop rollout

Image

[class name to be filled] (under progress)

For the colocate mode (where the rollout server and reward server are colocated in the sample resource pool), we need an extra class to:

  • Run in single controller and has the method compute_rm_scores(batch: DataProto) -> DataProto
  • Init with multiple RewardManagerWorker to handle incoming batch to compute rewards

Also, we note that some class names may confuse (RewardModelManager, RewardManagerWorker, XXXRewardLoopManager), and we are open to suggestions.

yyDing1 avatar Nov 27 '25 04:11 yyDing1

@wuxibin89 @vermouth1992 @PeterSH6 👀

yyDing1 avatar Nov 27 '25 04:11 yyDing1

那现在还能按照原来的方式自定义reward manager么

weixiaolong94-hub avatar Nov 27 '25 13:11 weixiaolong94-hub

You can hard-code enable_async_reward = False in verl/experimental/agent_loop.py to disable the reward loop.

BTW, reward loop will become the default, and the old reward manager will be deprecated.

In my view, the reward loop can cover all features of the legacy reward manager. And users have no need to change the code. If you have any use cases that the reward loop cannot support, please feel free to bring them up.


UPD: You can add reward_model.use_reward_loop=False to customize reward manager as previously do.

yyDing1 avatar Nov 29 '25 17:11 yyDing1

Can you share some scripts/examples to use genRM in reward loop?

edc3000 avatar Dec 08 '25 02:12 edc3000

Please refer to recipe/fapo/run_fapo_7b.sh for example usage and this doc for more user instructions and architecture designs.

yyDing1 avatar Dec 08 '25 05:12 yyDing1

Hi I m having the same issue as #4346 with rewardmanagerLoop For when i do custom reward manager. Is there any example on how to define custom rewardManagerLoop? run_fapo_7b.sh doesn't use rewardManagerLoop and the doc only explaint he implementation and doesn't give examples.

yumikim381 avatar Dec 09 '25 15:12 yumikim381

Thanks for the great feature.

enable_async_reward = (
            self.reward_router_address is not None and self.config.reward_model.enable_resource_pool
        ) or not self.config.reward_model.enable

Does this disallow using reward vLLM server in collocate mode?

Liang-Qiu avatar Dec 09 '25 18:12 Liang-Qiu

@yumikim381 Hi, relevant docs will be added soon. You can also refer to the examples in verl/experimental/reward/reward_loop/*.py, where you should inherit from the RewardLoopManagerBase class and implement the run_single method.

@Liang-Qiu Thanks for your attention. Colocate mode will allow using reward loop disrm in https://github.com/volcengine/verl/pull/4466. BTW, in colocate mode, reward workers are launched in single controller rather than agent loop worker.

yyDing1 avatar Dec 10 '25 04:12 yyDing1

The current Reward Loop implementation appears to launch multiple workers for parallel reward computation, but I'm using an external Generate Reward Model service with limited concurrency capacity. I cannot find any configuration parameter or mechanism to control the maximum number of concurrent requests sent to the external RM service. Without proper concurrency limiting, this could overwhelm our external service and cause request failures or degraded performance.

Question: Is there a built-in way to configure the maximum concurrent requests to external reward model services? If not, could this feature be added to prevent overloading external RM APIs?

nantenT avatar Dec 12 '25 06:12 nantenT

@nantenT You can use the rate limit reward manager by setting reward_model.reward_manager=limited, with implementation details in verl/experimental/reward/reward_loop/limited.py.

You can directly modify the concurrency parameters in https://github.com/volcengine/verl/blob/01ab536cdfa9ffdd36d9b8996448c6b680fbe695/verl/experimental/reward/reward_loop/limited.py#L264-L307 (as a built-in way).

yyDing1 avatar Dec 12 '25 07:12 yyDing1

@yyDing1 请问一下,如果我将 enable_async_reward = ( self.reward_router_address is not None and self.config.reward_model.enable_resource_pool ) or not self.config.reward_model.enable 硬编码为: enable_async_reward = False 是不是奖励计算会回退到verl/verl/workers/reward_manager/naive.py

SupreCyk avatar Dec 17 '25 09:12 SupreCyk

You can directly set reward_model.use_reward_loop=False to use the batch reward manager.

yyDing1 avatar Dec 17 '25 13:12 yyDing1

why it is called rewardLoop by the way? I don't see any sort of loop in reward manager tho

paipeng-quiver avatar Dec 21 '25 15:12 paipeng-quiver

@paipeng-quiver Good question. The name reward loop is mainly motivated by the design goal and abstraction.

  • The reward loop is designed for hybrid reward scenarios, where users may compose multiple reward sources, e.g., different discriminative RMs (disRM), generative RMs (genRM), and rule-based rewards, and define how they are grouped and combined. In this setting, reward computation is not a one-shot function call (which previous reward does) but a user-defined interaction pattern between the policy outputs and multiple reward components, this is what "loop" does.
  • Reward Loop follows the design philosophy of Agent Loop, i.e., rewards are computed asynchronously. Even if the reward manager itself does not expose an explicit loop, the overall reward computation still proceeds in a cycle of questions -> agent loop → reward loop → rollout with reward score.

So the term rewardLoop is intended to capture the idea that reward computation is an ongoing, and asynchronous process, rather than a single static one-function pass.

yyDing1 avatar Dec 24 '25 12:12 yyDing1