llama Will the evaluation code release?

I want to reproduce the evaluation results, such as on QA or reasoning task, will the evaluation code release? Is there any recommendation to fast implement it?

Mar 17 '23 18:03 lshowway

I want to do the same thing, do you find any solutions?

Mar 23 '23 04:03 yysjasmine

@yysjasmine Not now. Do you have any idea?

Mar 23 '23 13:03 lshowway

I tried lm-evaluation-harness here: https://github.com/EleutherAI/lm-evaluation-harness/ using huggingface released llama 7B model: https://huggingface.co/decapoda-research/llama-7b-hf, but the results on PIQA and HellaSwag dataset had 2% to 3% decrease compared to the original paper data, and I haven't found the cause yet.

Mar 27 '23 03:03 yysjasmine

I tried lm-evaluation-harness here: https://github.com/EleutherAI/lm-evaluation-harness/ using huggingface released llama 7B model: https://huggingface.co/decapoda-research/llama-7b-hf, but the results on PIQA and HellaSwag dataset had 2% to 3% decrease compared to the original paper data, and I haven't found the cause yet.

Hi, @yysjasmine, I am also trying to reproduce the results of llama, could you please share your test script/command, thanks!

Mar 27 '23 07:03 liuxiaocs7

@yysjasmine Sounds great. I am trying the Evaluation framework released with GPT-4, i.e., Evals.

Mar 27 '23 08:03 lshowway

I also tried to reproduce paper results with lm-eval-harness and got none negligible differences between lm-eval-harness output and paper.

I understood from the paper that lm-eval-harness was used to evaluate the model so can the authors explain what differences they implemented in the evaluation scripts?

Thanks

Mar 27 '23 23:03 ofirzaf

I also tried to reproduce paper results with lm-eval-harness and got none negligible differences between lm-eval-harness output and paper.

I understood from the paper that lm-eval-harness was used to evaluate the model so can the authors explain what differences they implemented in the evaluation scripts?

Thanks

Would you share your test results?

Mar 28 '23 03:03 yysjasmine

@yysjasmine Sounds great. I am trying the Evaluation framework released with GPT-4, i.e., Evals.

@lshowway That's a good idead! Can you reproduce llama's result based on GPT-4 evals?

Mar 28 '23 06:03 yysjasmine

@yysjasmine I didn't test LLAMA on their test sets. Evals framework only includes openAI models, and I focused on specific topic on RealToxicityPrompt dataset, the toxicity of GPT-3.5-turbo is pretty higher than LLAMA.

Mar 31 '23 08:03 lshowway

I also got the same ~2-3% decrease in performance when replicating, not sure why

Ex: 7B on Hellaswag - on paper: 76.1%, in replicated testing: 72.9%

Apr 18 '23 07:04 philwee

I also got the same ~2-3% decrease in performance when replicating, not sure why

Ex: 7B on Hellaswag - on paper: 76.1%, in replicated testing: 72.9%

hi, @philwee @yysjasmine Did you evaluate the accuracy of RACE task? I got a low performance on this task, the result is as follows. Do you have any ideas? Appreciated for your help.

Apr 23 '23 12:04 lxw0109

I also got the same ~2-3% decrease in performance when replicating, not sure why Ex: 7B on Hellaswag - on paper: 76.1%, in replicated testing: 72.9%

hi, @philwee @yysjasmine Did you evaluate the accuracy of RACE task? I got a low performance on this task, the result is as follows. Do you have any ideas? Appreciated for your help.

Hi, @lxw0109 , I also get a pretty lower results on RACE with 42.11 under the mixed version. May I know if you have any improvements?

May 20 '23 16:05 SparkJiao

Hi, @lxw0109 , I also get a pretty lower results on RACE with 42.11 under the mixed version. May I know if you have any improvements?

No improvements, T_T

May 22 '23 04:05 lxw0109

It seems lm-eval-harness can reproduce the Llama (paper ver.1) performance on Hellaswag, but some issues remain in other tasks. Llama-30b model gives 82.65 % acc_norm while paper shows 82.9%.

Sep 07 '23 09:09 hjeon2k

So i have been trying to evaluate llama-2 7b on squad dataset. I do so by creating a pipeline which creates prompts by concatenating context and question directly from the dataset and then storing the output in a dataframe. Upon doing this, the model returns 30% accuracy as compared to ~60% mentioned in paper. Am i doing something wrong?

Nov 05 '23 11:11 Sanchit-404