llama icon indicating copy to clipboard operation
llama copied to clipboard

Will the evaluation code release?

Open lshowway opened this issue 2 years ago • 11 comments

I want to reproduce the evaluation results, such as on QA or reasoning task, will the evaluation code release? Is there any recommendation to fast implement it?

lshowway avatar Mar 17 '23 18:03 lshowway

I want to do the same thing, do you find any solutions?

yysjasmine avatar Mar 23 '23 04:03 yysjasmine

@yysjasmine Not now. Do you have any idea?

lshowway avatar Mar 23 '23 13:03 lshowway

I tried lm-evaluation-harness here: https://github.com/EleutherAI/lm-evaluation-harness/ using huggingface released llama 7B model: https://huggingface.co/decapoda-research/llama-7b-hf, but the results on PIQA and HellaSwag dataset had 2% to 3% decrease compared to the original paper data, and I haven't found the cause yet.

yysjasmine avatar Mar 27 '23 03:03 yysjasmine

I tried lm-evaluation-harness here: https://github.com/EleutherAI/lm-evaluation-harness/ using huggingface released llama 7B model: https://huggingface.co/decapoda-research/llama-7b-hf, but the results on PIQA and HellaSwag dataset had 2% to 3% decrease compared to the original paper data, and I haven't found the cause yet.

Hi, @yysjasmine, I am also trying to reproduce the results of llama, could you please share your test script/command, thanks!

liuxiaocs7 avatar Mar 27 '23 07:03 liuxiaocs7

@yysjasmine Sounds great. I am trying the Evaluation framework released with GPT-4, i.e., Evals.

lshowway avatar Mar 27 '23 08:03 lshowway

I also tried to reproduce paper results with lm-eval-harness and got none negligible differences between lm-eval-harness output and paper.

I understood from the paper that lm-eval-harness was used to evaluate the model so can the authors explain what differences they implemented in the evaluation scripts?

Thanks

ofirzaf avatar Mar 27 '23 23:03 ofirzaf

I also tried to reproduce paper results with lm-eval-harness and got none negligible differences between lm-eval-harness output and paper.

I understood from the paper that lm-eval-harness was used to evaluate the model so can the authors explain what differences they implemented in the evaluation scripts?

Thanks

Would you share your test results?

yysjasmine avatar Mar 28 '23 03:03 yysjasmine

@yysjasmine Sounds great. I am trying the Evaluation framework released with GPT-4, i.e., Evals.

@lshowway That's a good idead! Can you reproduce llama's result based on GPT-4 evals?

yysjasmine avatar Mar 28 '23 06:03 yysjasmine

@yysjasmine I didn't test LLAMA on their test sets. Evals framework only includes openAI models, and I focused on specific topic on RealToxicityPrompt dataset, the toxicity of GPT-3.5-turbo is pretty higher than LLAMA.

lshowway avatar Mar 31 '23 08:03 lshowway

I also got the same ~2-3% decrease in performance when replicating, not sure why

Ex: 7B on Hellaswag - on paper: 76.1%, in replicated testing: 72.9%

philwee avatar Apr 18 '23 07:04 philwee

I also got the same ~2-3% decrease in performance when replicating, not sure why

Ex: 7B on Hellaswag - on paper: 76.1%, in replicated testing: 72.9%

hi, @philwee @yysjasmine Did you evaluate the accuracy of RACE task? I got a low performance on this task, the result is as follows. Do you have any ideas? Appreciated for your help. image

lxw0109 avatar Apr 23 '23 12:04 lxw0109

I also got the same ~2-3% decrease in performance when replicating, not sure why Ex: 7B on Hellaswag - on paper: 76.1%, in replicated testing: 72.9%

hi, @philwee @yysjasmine Did you evaluate the accuracy of RACE task? I got a low performance on this task, the result is as follows. Do you have any ideas? Appreciated for your help. image

Hi, @lxw0109 , I also get a pretty lower results on RACE with 42.11 under the mixed version. May I know if you have any improvements?

SparkJiao avatar May 20 '23 16:05 SparkJiao

Hi, @lxw0109 , I also get a pretty lower results on RACE with 42.11 under the mixed version. May I know if you have any improvements?

No improvements, T_T

lxw0109 avatar May 22 '23 04:05 lxw0109

It seems lm-eval-harness can reproduce the Llama (paper ver.1) performance on Hellaswag, but some issues remain in other tasks. Llama-30b model gives 82.65 % acc_norm while paper shows 82.9%.

hjeon2k avatar Sep 07 '23 09:09 hjeon2k

So i have been trying to evaluate llama-2 7b on squad dataset. I do so by creating a pipeline which creates prompts by concatenating context and question directly from the dataset and then storing the output in a dataframe. Upon doing this, the model returns 30% accuracy as compared to ~60% mentioned in paper. Am i doing something wrong?

Sanchit-404 avatar Nov 05 '23 11:11 Sanchit-404