stanford_alpaca
stanford_alpaca copied to clipboard
Align human preferences with Alpaca without reinforcement learning
We propose a new learning paradigm named RRHF (Rank Responses to Align Human Feedback) which does not need reinforcement learning and can perform on par with PPO to align human preferences. We fine tune Alpaca to Wombat using RRHF by various ChatGPT responses within 2 hours of training.
https://github.com/GanjinZero/RRHF