Will the evaluation code release?
I want to reproduce the evaluation results, such as on QA or reasoning task, will the evaluation code release? Is there any recommendation to fast implement it?
I want to do the same thing, do you find any solutions?
@yysjasmine Not now. Do you have any idea?
I tried lm-evaluation-harness here: https://github.com/EleutherAI/lm-evaluation-harness/ using huggingface released llama 7B model: https://huggingface.co/decapoda-research/llama-7b-hf, but the results on PIQA and HellaSwag dataset had 2% to 3% decrease compared to the original paper data, and I haven't found the cause yet.
I tried lm-evaluation-harness here: https://github.com/EleutherAI/lm-evaluation-harness/ using huggingface released llama 7B model: https://huggingface.co/decapoda-research/llama-7b-hf, but the results on PIQA and HellaSwag dataset had 2% to 3% decrease compared to the original paper data, and I haven't found the cause yet.
Hi, @yysjasmine, I am also trying to reproduce the results of llama, could you please share your test script/command, thanks!
@yysjasmine Sounds great. I am trying the Evaluation framework released with GPT-4, i.e., Evals.
I also tried to reproduce paper results with lm-eval-harness and got none negligible differences between lm-eval-harness output and paper.
I understood from the paper that lm-eval-harness was used to evaluate the model so can the authors explain what differences they implemented in the evaluation scripts?
Thanks
I also tried to reproduce paper results with lm-eval-harness and got none negligible differences between lm-eval-harness output and paper.
I understood from the paper that lm-eval-harness was used to evaluate the model so can the authors explain what differences they implemented in the evaluation scripts?
Thanks
Would you share your test results?
@yysjasmine Sounds great. I am trying the Evaluation framework released with GPT-4, i.e., Evals.
@lshowway That's a good idead! Can you reproduce llama's result based on GPT-4 evals?
@yysjasmine I didn't test LLAMA on their test sets. Evals framework only includes openAI models, and I focused on specific topic on RealToxicityPrompt dataset, the toxicity of GPT-3.5-turbo is pretty higher than LLAMA.
I also got the same ~2-3% decrease in performance when replicating, not sure why
Ex: 7B on Hellaswag - on paper: 76.1%, in replicated testing: 72.9%
I also got the same ~2-3% decrease in performance when replicating, not sure why
Ex: 7B on Hellaswag - on paper: 76.1%, in replicated testing: 72.9%
hi, @philwee @yysjasmine Did you evaluate the accuracy of RACE task? I got a low performance on this task, the result is as follows. Do you have any ideas? Appreciated for your help.

I also got the same ~2-3% decrease in performance when replicating, not sure why Ex: 7B on Hellaswag - on paper: 76.1%, in replicated testing: 72.9%
hi, @philwee @yysjasmine Did you evaluate the accuracy of RACE task? I got a low performance on this task, the result is as follows. Do you have any ideas? Appreciated for your help.
Hi, @lxw0109 , I also get a pretty lower results on RACE with 42.11 under the mixed version. May I know if you have any improvements?
Hi, @lxw0109 , I also get a pretty lower results on RACE with 42.11 under the mixed version. May I know if you have any improvements?
No improvements, T_T
It seems lm-eval-harness can reproduce the Llama (paper ver.1) performance on Hellaswag, but some issues remain in other tasks. Llama-30b model gives 82.65 % acc_norm while paper shows 82.9%.
So i have been trying to evaluate llama-2 7b on squad dataset. I do so by creating a pipeline which creates prompts by concatenating context and question directly from the dataset and then storing the output in a dataframe. Upon doing this, the model returns 30% accuracy as compared to ~60% mentioned in paper. Am i doing something wrong?