deita icon indicating copy to clipboard operation
deita copied to clipboard

Questions about performance improvement in Open LLM leaderboard

Open minstar opened this issue 11 months ago • 3 comments

Hi, First of all, thank you for sharing your wonderful work!

I was searching for efficient ways of mining instructions used in instruction-tuning LLMs. While reading the manuscript and investigating your provided open-sourced 6k & 10k datasets, I could not intuitively understand why the SFT (6k) +DPO (10k) training method increases the performance of the multi-choice question answering tasks such as ARC-challenge and MMLU?

In the dataset, the instances are composed of conversations between humans and GPT which don't have any clue about solving multi-choice QA problems.

Do you have any ideas why it worked?

minstar avatar Mar 07 '24 01:03 minstar

Hi, thanks for your interest!

This question is indeed interesting. We have a couple of speculations that might shed some light:

  1. Our top-performing model, trained using SFT (6k) combined with DPO (10k), originates from an intermediate SFT checkpoint. This checkpoint serves as the basis for further DPO training. Our hypothesis is that an overly optimized SFT might impair the inherent capabilities of LLMs. Therefore, utilizing a sub-optimal SFT checkpoint, followed by DPO training, which is specifically designed for alignment, appears to enhance performance on both academic benchmarks like the OpenLLM benchmark and alignment capabilities. This finding can also be found on Zephyr [1, 2].

  2. It's observed that some questions incorrectly answered by the models can be rectified through multiple sampling attempts, employing strategies like majority voting or re-ranking. This indicates that the model has the potential to answer correctly but struggles to do so consistently. Reinforcement learning techniques such as DPO can adjust the model's output preferences, increasing the likelihood of producing the correct answer in a single attempt [3, Section 5].

References

VPeterV avatar Mar 21 '24 03:03 VPeterV

Thanks for suggesting your insights and thoughts about my curious question!

I also agreed on the second point that the model has the potential to answer but not consistently to do it. However, still have a hard time interpreting what DPO could enhance through preference alignment.

minstar avatar Mar 21 '24 05:03 minstar

A potential explanation might be the presence of STEM-related samples within the UltraFeedback Datasets.

VPeterV avatar Mar 21 '24 05:03 VPeterV