Yuyang Ding comments

Results 16 comments of


                                            Yuyang Ding

[trainer] fix: PPO reward model resource pool and worker creation with reward loop

The current implementation of the reward loop actually creates rollout servers colocated with the resources of `worker_group`, without invoking any methods of `worker_group` itself, so it may not cause errors....

[trainer] fix: PPO reward model resource pool and worker creation with reward loop

I Got it. vllm replica does call methods from the worker class, and this was missed in the previous CI tests.

dapo reward_manager

This part has been moved to `verl/experimental/reward/reward_loop/dapo.py`.

dapo reward_manager

Correct a potential misunderstanding. The implementation of the DAPO reward manager has consistently remained under the `verl/` directory, i.e., `verl/workers/reward_manager/dapo.py`

[rollout] fix: resource pool name in standalone mode

relevant ci has been added in https://github.com/volcengine/verl/blob/main/.github/workflows/reward_model_vllm.yml and https://github.com/volcengine/verl/blob/main/.github/workflows/reward_model_sglang.yml

[algo] feat: Add RateLimitedRewardLoopManager with three-layer rate limiting for API-based rewards

LGTM @wuxibin89

About SFT reproduction

We have released the SFT reproduction materials [here](https://drive.google.com/drive/folders/1kg7YDRk8jK4_Bo19jJpZtdAQMBoucppW). Unfortunately, the checkpoint files for the flan-t5-xxl and llama models were lost during transfer due to their large sizes and unstable transmission....

About model ckpt

Unfortunately, the checkpoint files for the flan-t5-xxl and llama models were lost during transfer due to their large sizes and unstable transmission. We welcome replication efforts, and we also plan...

Unable to replicate the results of Qwen2.5-Math-7B-Instruct

You can use the scripts [here](https://github.com/yyDing1/SCAN-PRM/blob/main/src/eval_prm/main_bon.py) to reproduce the results (adapted from qwen eval). It also supports majority voting and integration of process reward model. Our results: Qwen2.5-Math-7B-Ins Greedy: 47.1...

[RFC] Reward Loop

@wuxibin89 @vermouth1992 @PeterSH6 👀