Yuyang Ding
Yuyang Ding
The current implementation of the reward loop actually creates rollout servers colocated with the resources of `worker_group`, without invoking any methods of `worker_group` itself, so it may not cause errors....
I Got it. vllm replica does call methods from the worker class, and this was missed in the previous CI tests.
This part has been moved to `verl/experimental/reward/reward_loop/dapo.py`.
Correct a potential misunderstanding. The implementation of the DAPO reward manager has consistently remained under the `verl/` directory, i.e., `verl/workers/reward_manager/dapo.py`
relevant ci has been added in https://github.com/volcengine/verl/blob/main/.github/workflows/reward_model_vllm.yml and https://github.com/volcengine/verl/blob/main/.github/workflows/reward_model_sglang.yml
LGTM @wuxibin89
We have released the SFT reproduction materials [here](https://drive.google.com/drive/folders/1kg7YDRk8jK4_Bo19jJpZtdAQMBoucppW). Unfortunately, the checkpoint files for the flan-t5-xxl and llama models were lost during transfer due to their large sizes and unstable transmission....
Unfortunately, the checkpoint files for the flan-t5-xxl and llama models were lost during transfer due to their large sizes and unstable transmission. We welcome replication efforts, and we also plan...
You can use the scripts [here](https://github.com/yyDing1/SCAN-PRM/blob/main/src/eval_prm/main_bon.py) to reproduce the results (adapted from qwen eval). It also supports majority voting and integration of process reward model. Our results: Qwen2.5-Math-7B-Ins Greedy: 47.1...
@wuxibin89 @vermouth1992 @PeterSH6 👀