stanford_alpaca icon indicating copy to clipboard operation
stanford_alpaca copied to clipboard

Align human preferences with Alpaca without reinforcement learning

Open GanjinZero opened this issue 1 year ago • 0 comments

We propose a new learning paradigm named RRHF (Rank Responses to Align Human Feedback) which does not need reinforcement learning and can perform on par with PPO to align human preferences. We fine tune Alpaca to Wombat using RRHF by various ChatGPT responses within 2 hours of training.

https://github.com/GanjinZero/RRHF

GanjinZero avatar Apr 11 '23 11:04 GanjinZero