trl
trl copied to clipboard
[GRPO] initial GRPO trainer
Implementation of the DeepSeekMath GRPO: https://arxiv.org/pdf/2402.03300
Still a work in progress
- Will be adding iterative reward model training
- Only outcome supervision has been enabled, will be implementing process supervision later
closes #2103