trl
trl copied to clipboard
Why not create an example of using PPO to train a summarization model?
Thank you for your excellent work. I have a question. I saw that you provided the script for training the RM model from the OpenAI paper 'Learning to Summarize from Human Feedback'. Why haven't you further trained PPO on that dataset.
Lack of time and people - feel free to try it!