deita
deita copied to clipboard
Questions about performance improvement in Open LLM leaderboard
Hi, First of all, thank you for sharing your wonderful work!
I was searching for efficient ways of mining instructions used in instruction-tuning LLMs. While reading the manuscript and investigating your provided open-sourced 6k & 10k datasets, I could not intuitively understand why the SFT (6k) +DPO (10k) training method increases the performance of the multi-choice question answering tasks such as ARC-challenge and MMLU?
In the dataset, the instances are composed of conversations between humans and GPT which don't have any clue about solving multi-choice QA problems.
Do you have any ideas why it worked?
Hi, thanks for your interest!
This question is indeed interesting. We have a couple of speculations that might shed some light:
-
Our top-performing model, trained using SFT (6k) combined with DPO (10k), originates from an intermediate SFT checkpoint. This checkpoint serves as the basis for further DPO training. Our hypothesis is that an overly optimized SFT might impair the inherent capabilities of LLMs. Therefore, utilizing a sub-optimal SFT checkpoint, followed by DPO training, which is specifically designed for alignment, appears to enhance performance on both academic benchmarks like the OpenLLM benchmark and alignment capabilities. This finding can also be found on Zephyr [1, 2].
-
It's observed that some questions incorrectly answered by the models can be rectified through multiple sampling attempts, employing strategies like majority voting or re-ranking. This indicates that the model has the potential to answer correctly but struggles to do so consistently. Reinforcement learning techniques such as DPO can adjust the model's output preferences, increasing the likelihood of producing the correct answer in a single attempt [3, Section 5].
References
- [1] Zephyr: Direct Distillation of LM Alignment https://arxiv.org/abs/2310.16944
- [2] Reinforcement Learning for Language Models https://gist.github.com/yoavg/6bff0fecd65950898eba1bb321cfbd81
- [3] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models https://arxiv.org/abs/2402.03300
Thanks for suggesting your insights and thoughts about my curious question!
I also agreed on the second point that the model has the potential to answer but not consistently to do it. However, still have a hard time interpreting what DPO could enhance through preference alignment.
A potential explanation might be the presence of STEM-related samples within the UltraFeedback Datasets.