Vision-SR1
Vision-SR1 copied to clipboard
Request for script to reproduce naive GRPO baseline
Hello, thank you for sharing this great project! 🙏
I would like to reproduce and test the naive GRPO baseline (without the second-pass self-reward) under the same environment and settings as in your work.
Is there an existing script for running naive GRPO, or could you kindly suggest the simplest way to set it up?
Thank you in advance for your help!
Refer to the script. The main change is the reward function and the prompt template, which only rewards the final answer and using a CoT prompt.