trl icon indicating copy to clipboard operation
trl copied to clipboard

[GRPO] initial GRPO trainer

Open saisurbehera opened this issue 6 months ago • 3 comments

Implementation of the DeepSeekMath GRPO: https://arxiv.org/pdf/2402.03300

Still a work in progress

  • Will be adding iterative reward model training
  • Only outcome supervision has been enabled, will be implementing process supervision later

closes #2103

saisurbehera avatar Aug 21 '24 02:08 saisurbehera