stanford_alpaca Align human preferences with Alpaca without reinforcement learning

Align human preferences with Alpaca without reinforcement learning

Open GanjinZero opened this issue 1 year ago • 0 comments

We propose a new learning paradigm named RRHF (Rank Responses to Align Human Feedback) which does not need reinforcement learning and can perform on par with PPO to align human preferences. We fine tune Alpaca to Wombat using RRHF by various ChatGPT responses within 2 hours of training.

https://github.com/GanjinZero/RRHF

Apr 11 '23 11:04 GanjinZero

stanford_alpaca stanford_alpaca copied to clipboard

Align human preferences with Alpaca without reinforcement learning

stanford_alpaca
stanford_alpaca copied to clipboard