Async pipeline in generate and compute score
With async rollout, we've separated the pipeline during the generate stage. However, we must wait for the batch to complete before moving to the reward stage. When compute_reward_async is enabled, reward calculation can run in parallel with old_log_prob and value computation. In practice, reward calculation is often slower than old_log_prob and value computation. This creates GPU idle time before computing advantages, as shown in the figure below:
To start reward calculation sooner and avoid GPU idle time, it’s better to integrate compute_score into the generate pipeline:
- compute_score will be executed by a Ray actor.
- The reward manager get Ray futures from compute_score, then calculate reward_tensor and reward_extra_info from the scores.
We have already implemented this feature, please check reward_model.launch_reward_fn_async=True argument
We have already implemented this feature, please check
reward_model.launch_reward_fn_async=Trueargument
Beside this reward_model.launch_reward_fn_async=True setting, do we need to self define or ray decoration for compute_score function? Looking at nvtop as provided picture, more than half of the time, the GPU is idle.
Also, if I have my own function that could compute a batch of scores instead of just one by leveraging multi-core CPUs, how could I incorporate into the framework?
We have already implemented this feature, please check
reward_model.launch_reward_fn_async=Trueargument
yes,my picture in first post is assume reward_model.launch_reward_fn_async=True. In current implementation, reward is a batch task async with old_log_prob and compute value.
In code RL job, reward take longer time than old_log_prob and compute value, and results GPU idle.
The idea in this issue is to move compute_core in generate pipeline from the batched reward step.
Probably best way to do this is using async chat scheduler and collecting reward results at the end.
For batch rewards, we have implemented a batch reward manager, please check that On Thu 29 May 2025 at 05:18, OC @.***> wrote:
chenhaiq left a comment (volcengine/verl#1584) https://github.com/volcengine/verl/issues/1584#issuecomment-2918121441
We have already implemented this feature, please check reward_model.launch_reward_fn_async=True argument
yes,my picture in first post is assume reward_model.launch_reward_fn_async=True. In current implementation, reward is a batch task async with old_log_prob and compute value.
In code RL job, reward take longer time than old_log_prob and compute value, and results GPU idle
— Reply to this email directly, view it on GitHub https://github.com/volcengine/verl/issues/1584#issuecomment-2918121441, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFIEPR7OA7DLDKFONKSNWJL3AZ4BZAVCNFSM6AAAAAB5NTAFWCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDSMJYGEZDCNBUGE . You are receiving this because you commented.Message ID: @.***>
@mertunsall I have set reward_model.launch_reward_fn_async=True and reward_model.reward_manager=prime, but rewards are calculated after all rollouts completed. Even though reward calculation is asynchronous, this causes GPU utilization to be zero for extended periods of time. Are there any suggested modifications or examples to achieve simultaneous rollout and reward calculation?