verl icon indicating copy to clipboard operation
verl copied to clipboard

Async pipeline in generate and compute score

Open chenhaiq opened this issue 7 months ago • 4 comments

With async rollout, we've separated the pipeline during the generate stage. However, we must wait for the batch to complete before moving to the reward stage. When compute_reward_async is enabled, reward calculation can run in parallel with old_log_prob and value computation. In practice, reward calculation is often slower than old_log_prob and value computation. This creates GPU idle time before computing advantages, as shown in the figure below:

Image

To start reward calculation sooner and avoid GPU idle time, it’s better to integrate compute_score into the generate pipeline:

  1. compute_score will be executed by a Ray actor.
  2. The reward manager get Ray futures from compute_score, then calculate reward_tensor and reward_extra_info from the scores.

chenhaiq avatar May 19 '25 12:05 chenhaiq

We have already implemented this feature, please check reward_model.launch_reward_fn_async=True argument

mertunsall avatar May 23 '25 22:05 mertunsall

We have already implemented this feature, please check reward_model.launch_reward_fn_async=True argument

Beside this reward_model.launch_reward_fn_async=True setting, do we need to self define or ray decoration for compute_score function? Looking at nvtop as provided picture, more than half of the time, the GPU is idle.

Also, if I have my own function that could compute a batch of scores instead of just one by leveraging multi-core CPUs, how could I incorporate into the framework?

Image

Image

HorHang avatar May 29 '25 02:05 HorHang

We have already implemented this feature, please check reward_model.launch_reward_fn_async=True argument

yes,my picture in first post is assume reward_model.launch_reward_fn_async=True. In current implementation, reward is a batch task async with old_log_prob and compute value.

In code RL job, reward take longer time than old_log_prob and compute value, and results GPU idle.

The idea in this issue is to move compute_core in generate pipeline from the batched reward step.

chenhaiq avatar May 29 '25 03:05 chenhaiq

Probably best way to do this is using async chat scheduler and collecting reward results at the end.

For batch rewards, we have implemented a batch reward manager, please check that On Thu 29 May 2025 at 05:18, OC @.***> wrote:

chenhaiq left a comment (volcengine/verl#1584) https://github.com/volcengine/verl/issues/1584#issuecomment-2918121441

We have already implemented this feature, please check reward_model.launch_reward_fn_async=True argument

yes,my picture in first post is assume reward_model.launch_reward_fn_async=True. In current implementation, reward is a batch task async with old_log_prob and compute value.

In code RL job, reward take longer time than old_log_prob and compute value, and results GPU idle

— Reply to this email directly, view it on GitHub https://github.com/volcengine/verl/issues/1584#issuecomment-2918121441, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFIEPR7OA7DLDKFONKSNWJL3AZ4BZAVCNFSM6AAAAAB5NTAFWCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDSMJYGEZDCNBUGE . You are receiving this because you commented.Message ID: @.***>

mertunsall avatar May 29 '25 07:05 mertunsall

@mertunsall I have set reward_model.launch_reward_fn_async=True and reward_model.reward_manager=prime, but rewards are calculated after all rollouts completed. Even though reward calculation is asynchronous, this causes GPU utilization to be zero for extended periods of time. Are there any suggested modifications or examples to achieve simultaneous rollout and reward calculation?

edc3000 avatar Sep 25 '25 07:09 edc3000