DeepSpeedExamples
DeepSpeedExamples copied to clipboard
Add DPO support for DeepSpeed-Chat
Considering the advantages of DPO(Direct Preference Optimization) as being "stable, performant, and computationally lightweight, eliminating the need for fitting a reward model, sampling from the LM during fine-tuning, or performing significant hyperparameter tuning", we add DPO support for DeepSpeed-Chat.
Accidentally closed the PR.. Sorry :(