stanford_alpaca icon indicating copy to clipboard operation
stanford_alpaca copied to clipboard

Why proceed with this kind of research evaluation process?

Open timothylimyl opened this issue 1 year ago • 1 comments

Please correct me if I got anything wrong. I am trying to learn more about LLM research.

Alpaca Contribution: Your research team instruction tuned llama via the self-instruct method and evaluated on ~250 instruction test set. Reference to your blog: """ This evaluation set was collected by the self-instruct authors and covers a diverse list of user-oriented instructions including email writing, social media, and productivity tools. We performed a blind pairwise comparison between text-davinci-003 and Alpaca 7B, and we found that these two models have very similar performance: Alpaca wins 90 versus 89 comparisons against text-davinci-003.

We were quite surprised by this result given the small model size and the modest amount of instruction following data. Besides leveraging this static evaluation set, we have also been testing the Alpaca model interactively and found that Alpaca often behaves similarly to text-davinci-003 on a diverse set of inputs. We acknowledge that our evaluation may be limited in scale and diversity. """

Question: Why didn't you all benchmark the instruction-tuned llama to the actual foundation model on tasks?

In the llama paper section 4 (instruction-tuning), the authors shows that every instruction tuned LLM improves its performances in the MMLU dataset (obviously we hope that it was benchmark for more tasks, but at least there is something meaningful to take away there). For alpaca, I do not get why didn't you benchmark accordingly, it seems the results are not very meaningful based off your current evaluation. How would I know whether your process of data collection and fine-tuning is superior or even helpful?

if you look at the research over from google in regards to FLAN, go to Section 4: Scaling Law, it shows that accuracy for held out tasks does not improve upon instruction tuning for smaller language models. From the perspective of this paper, the current held belief is that instruction tuning could be damaging to generalisability of the language model due to its smaller capacity (which can be the case for alpaca given the size). The FLAN paper was very methodical (many tasks to evaluate + held-out tasks). Thus, we are able to get some meaningful findings (which may be disprove in the future, but it is a new finding nevertheless).

In HELM, alpaca and llama models are nowhere to be found (as of 11/04/23). If the models was added here, it would had been meaningful! As claimed by your repo: """ In a preliminary human evaluation, we found that the Alpaca 7B model behaves similarly to the text-davinci-003 model on the Self-Instruct instruction-following evaluation suite """

Why not go further and evaluate like how typical research paper do? If the evaluation of Alpaca 7B and llama 7b was added to HELM, we can get pretty meaningful findings on whether instruction tuning helps (llama vs alpaca) and whether does it really behave similarly to text-davinci-003.

I guess the main thing I am unclear about is that given the way you are evaluating your model, what is the meaningful research results of alpaca that you were aiming for?

timothylimyl avatar Apr 11 '23 08:04 timothylimyl