Request for script to reproduce naive GRPO baseline

Open naajeehxe opened this issue 4 months ago • 1 comments

Hello, thank you for sharing this great project! 🙏

I would like to reproduce and test the naive GRPO baseline (without the second-pass self-reward) under the same environment and settings as in your work.

Is there an existing script for running naive GRPO, or could you kindly suggest the simplest way to set it up?

Thank you in advance for your help!

Sep 08 '25 05:09 naajeehxe

Refer to the script. The main change is the reward function and the prompt template, which only rewards the final answer and using a CoT prompt.

Sep 08 '25 23:09 zli12321