trl
trl copied to clipboard

Published 20 hours ago •

Reame
Issues

[GRPO] initial GRPO trainer

Open saisurbehera opened this issue 6 months ago • 3 comments

Implementation of the DeepSeekMath GRPO: https://arxiv.org/pdf/2402.03300

Still a work in progress

Will be adding iterative reward model training
Only outcome supervision has been enabled, will be implementing process supervision later

closes #2103

Aug 21 '24 02:08 saisurbehera