llama-recipes can't reproduce llama3.1 evaluation results

can't reproduce llama3.1 evaluation results

Open zhuhr925 opened this issue 6 months ago • 5 comments

System Info

[pip3] numpy==1.26.3 [pip3] torch==2.3.1+cu121 [pip3] torchaudio==2.3.1+cu121 [pip3] torchvision==0.18.1+cu121 [pip3] triton==2.3.1 [conda] numpy 1.26.3 pypi_0 pypi [conda] torch 2.3.1+cu121 pypi_0 pypi [conda] torchaudio 2.3.1+cu121 pypi_0 pypi [conda] torchvision 0.18.1+cu121 pypi_0 pypi [conda] triton 2.3.1 pypi_0 pypi

Information

[ ] The official example scripts
[ ] My own modified scripts

🐛 Describe the bug

I follow the readme, but I get the low scores. I download the llama-recipes/tools/benchmarks /llm_eval_harness/, and install the lm_evaluation_harness in this folder.Then I run the tools/benchmarks/llm_eval_harness/open_llm_eval_prep.sh and eval.py. How to repro the results correctly?

Error logs

2024-07-28:11:41:09,270 INFO |------------------------- |arc 25 shot |gsm8k | |hellaswag 10 shot |mmlu | - humanities | - formal_logic | - high_school_european_history | - high_school_us_history | - high_school_world_history | - international_law | - jurisprudence | - logical_fallacies | - moral_disputes | - moral_scenarios | - philosophy | - prehistory | - professional_law | - world_religions | - other | - business_ethics | - clinical_knowledge | - college_medicine | - global_facts | - human_aging | - management | - marketing | - medical_genetics | - miscellaneous | - nutrition | - professional_accounting | - professional_medicine | - virology | - social sciences | - econometrics | - high_school_geography | - high_school_governmen | - high_school_macroeconomics | - high_school_microeconomics | - high_school_psychology | - human_sexuality | - professional_psychology | - public_relations | - security_studies | - sociology | - us_foreign_policy | - stem | - abstract_algebra | - anatomy | - astronomy | - college_biology | - college_chemistry | - college_computer_science | - college_mathematics | - college_physics | - computer_security | - conceptual_physics | - electrical_engineering | - elementary_mathematics | - high_school_biology | - high_school_chemistry | - high_school_computer_science | - high_school_mathematics | - high_school_physics | - high_school_statistics | - machine_learning |truthfulqa_mc2 |winogrande 5 shot [eval.py:85] | Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr| --------------|------:|----------------|-----:|-----------|---|-----:|---|-----:| | 1|none | 25|acc_norm |↑ |0.3000|± |0.0461| | 3|flexible-extract| 5|exact_match|↑ |0.0300|± |0.0171| | |strict-match | 5|exact_match|↑ |0.0000|± |0.0000| | 1|none | 10|acc_norm |↑ |0.4200|± |0.0496| | 1|none | |acc |↑ |0.2354|± |0.0056| | 1|none | |acc |↑ |0.2462|± |0.0119| | 0|none | 0|acc |↑ |0.2300|± |0.0423| | 0|none | 0|acc |↑ |0.2800|± |0.0451| | 0|none | 0|acc |↑ |0.2400|± |0.0429| | 0|none | 0|acc |↑ |0.3000|± |0.0461| | 0|none | 0|acc |↑ |0.2500|± |0.0435| | 0|none | 0|acc |↑ |0.2900|± |0.0456| | 0|none | 0|acc |↑ |0.1900|± |0.0394| | 0|none | 0|acc |↑ |0.2500|± |0.0435| | 0|none | 0|acc |↑ |0.2200|± |0.0416| | 0|none | 0|acc |↑ |0.1600|± |0.0368| | 0|none | 0|acc |↑ |0.2300|± |0.0423| | 0|none | 0|acc |↑ |0.2200|± |0.0416| | 0|none | 0|acc |↑ |0.3400|± |0.0476| | 1|none | |acc |↑ |0.2354|± |0.0117| | 0|none | 0|acc |↑ |0.3200|± |0.0469| | 0|none | 0|acc |↑ |0.1400|± |0.0349| | 0|none | 0|acc |↑ |0.2100|± |0.0409| | 0|none | 0|acc |↑ |0.2100|± |0.0409| | 0|none | 0|acc |↑ |0.3100|± |0.0465| | 0|none | 0|acc |↑ |0.2000|± |0.0402| | 0|none | 0|acc |↑ |0.3400|± |0.0476| | 0|none | 0|acc |↑ |0.2800|± |0.0451| | 0|none | 0|acc |↑ |0.2100|± |0.0409| | 0|none | 0|acc |↑ |0.2300|± |0.0423| | 0|none | 0|acc |↑ |0.2100|± |0.0409| | 0|none | 0|acc |↑ |0.1500|± |0.0359| | 0|none | 0|acc |↑ |0.2500|± |0.0435| | 1|none | |acc |↑ |0.2258|± |0.0121| | 0|none | 0|acc |↑ |0.2800|± |0.0451| | 0|none | 0|acc |↑ |0.1600|± |0.0368| t_and_politics| 0|none | 0|acc |↑ |0.1700|± |0.0378| | 0|none | 0|acc |↑ |0.1600|± |0.0368| | 0|none | 0|acc |↑ |0.2200|± |0.0416| | 0|none | 0|acc |↑ |0.2200|± |0.0416| | 0|none | 0|acc |↑ |0.2500|± |0.0435| | 0|none | 0|acc |↑ |0.2300|± |0.0423| | 0|none | 0|acc |↑ |0.2200|± |0.0416| | 0|none | 0|acc |↑ |0.2300|± |0.0423| | 0|none | 0|acc |↑ |0.2900|± |0.0456| | 0|none | 0|acc |↑ |0.2800|± |0.0451| | 1|none | |acc |↑ |0.2342|± |0.0097| | 0|none | 0|acc |↑ |0.2300|± |0.0423| | 0|none | 0|acc |↑ |0.1900|± |0.0394| | 0|none | 0|acc |↑ |0.2100|± |0.0409| | 0|none | 0|acc |↑ |0.2500|± |0.0435| | 0|none | 0|acc |↑ |0.2500|± |0.0435| | 0|none | 0|acc |↑ |0.2600|± |0.0441| | 0|none | 0|acc |↑ |0.2000|± |0.0402| | 0|none | 0|acc |↑ |0.2300|± |0.0423| | 0|none | 0|acc |↑ |0.3000|± |0.0461| | 0|none | 0|acc |↑ |0.3300|± |0.0473| | 0|none | 0|acc |↑ |0.2500|± |0.0435| | 0|none | 0|acc |↑ |0.2500|± |0.0435| | 0|none | 0|acc |↑ |0.1300|± |0.0338| | 0|none | 0|acc |↑ |0.2000|± |0.0402| | 0|none | 0|acc |↑ |0.2900|± |0.0456| | 0|none | 0|acc |↑ |0.2200|± |0.0416| | 0|none | 0|acc |↑ |0.2000|± |0.0402| | 0|none | 0|acc |↑ |0.1600|± |0.0368| | 0|none | 0|acc |↑ |0.3000|± |0.0461| | 2|none | 0|acc |↑ |0.5092|± |0.0455| | 1|none | 5|acc |↑ |0.5500|± |0.0500|

Expected behavior

the results as the llama3.1 report.

Jul 28 '24 13:07 zhuhr925

llama-recipes llama-recipes copied to clipboard

can't reproduce llama3.1 evaluation results

System Info

Information

🐛 Describe the bug

Error logs

Expected behavior

llama-recipes
llama-recipes copied to clipboard