llama-recipes
llama-recipes copied to clipboard
can't reproduce llama3.1 evaluation results
System Info
[pip3] numpy==1.26.3 [pip3] torch==2.3.1+cu121 [pip3] torchaudio==2.3.1+cu121 [pip3] torchvision==0.18.1+cu121 [pip3] triton==2.3.1 [conda] numpy 1.26.3 pypi_0 pypi [conda] torch 2.3.1+cu121 pypi_0 pypi [conda] torchaudio 2.3.1+cu121 pypi_0 pypi [conda] torchvision 0.18.1+cu121 pypi_0 pypi [conda] triton 2.3.1 pypi_0 pypi
Information
- [ ] The official example scripts
- [ ] My own modified scripts
🐛 Describe the bug
I follow the readme, but I get the low scores. I download the llama-recipes/tools/benchmarks /llm_eval_harness/, and install the lm_evaluation_harness in this folder.Then I run the tools/benchmarks/llm_eval_harness/open_llm_eval_prep.sh and eval.py. How to repro the results correctly?
Error logs
2024-07-28:11:41:09,270 INFO [eval.py:85] | Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr| |---------------------------------------|------:|----------------|-----:|-----------|---|-----:|---|-----:| |arc 25 shot | 1|none | 25|acc_norm |↑ |0.3000|± |0.0461| |gsm8k | 3|flexible-extract| 5|exact_match|↑ |0.0300|± |0.0171| | | |strict-match | 5|exact_match|↑ |0.0000|± |0.0000| |hellaswag 10 shot | 1|none | 10|acc_norm |↑ |0.4200|± |0.0496| |mmlu | 1|none | |acc |↑ |0.2354|± |0.0056| | - humanities | 1|none | |acc |↑ |0.2462|± |0.0119| | - formal_logic | 0|none | 0|acc |↑ |0.2300|± |0.0423| | - high_school_european_history | 0|none | 0|acc |↑ |0.2800|± |0.0451| | - high_school_us_history | 0|none | 0|acc |↑ |0.2400|± |0.0429| | - high_school_world_history | 0|none | 0|acc |↑ |0.3000|± |0.0461| | - international_law | 0|none | 0|acc |↑ |0.2500|± |0.0435| | - jurisprudence | 0|none | 0|acc |↑ |0.2900|± |0.0456| | - logical_fallacies | 0|none | 0|acc |↑ |0.1900|± |0.0394| | - moral_disputes | 0|none | 0|acc |↑ |0.2500|± |0.0435| | - moral_scenarios | 0|none | 0|acc |↑ |0.2200|± |0.0416| | - philosophy | 0|none | 0|acc |↑ |0.1600|± |0.0368| | - prehistory | 0|none | 0|acc |↑ |0.2300|± |0.0423| | - professional_law | 0|none | 0|acc |↑ |0.2200|± |0.0416| | - world_religions | 0|none | 0|acc |↑ |0.3400|± |0.0476| | - other | 1|none | |acc |↑ |0.2354|± |0.0117| | - business_ethics | 0|none | 0|acc |↑ |0.3200|± |0.0469| | - clinical_knowledge | 0|none | 0|acc |↑ |0.1400|± |0.0349| | - college_medicine | 0|none | 0|acc |↑ |0.2100|± |0.0409| | - global_facts | 0|none | 0|acc |↑ |0.2100|± |0.0409| | - human_aging | 0|none | 0|acc |↑ |0.3100|± |0.0465| | - management | 0|none | 0|acc |↑ |0.2000|± |0.0402| | - marketing | 0|none | 0|acc |↑ |0.3400|± |0.0476| | - medical_genetics | 0|none | 0|acc |↑ |0.2800|± |0.0451| | - miscellaneous | 0|none | 0|acc |↑ |0.2100|± |0.0409| | - nutrition | 0|none | 0|acc |↑ |0.2300|± |0.0423| | - professional_accounting | 0|none | 0|acc |↑ |0.2100|± |0.0409| | - professional_medicine | 0|none | 0|acc |↑ |0.1500|± |0.0359| | - virology | 0|none | 0|acc |↑ |0.2500|± |0.0435| | - social sciences | 1|none | |acc |↑ |0.2258|± |0.0121| | - econometrics | 0|none | 0|acc |↑ |0.2800|± |0.0451| | - high_school_geography | 0|none | 0|acc |↑ |0.1600|± |0.0368| | - high_school_government_and_politics| 0|none | 0|acc |↑ |0.1700|± |0.0378| | - high_school_macroeconomics | 0|none | 0|acc |↑ |0.1600|± |0.0368| | - high_school_microeconomics | 0|none | 0|acc |↑ |0.2200|± |0.0416| | - high_school_psychology | 0|none | 0|acc |↑ |0.2200|± |0.0416| | - human_sexuality | 0|none | 0|acc |↑ |0.2500|± |0.0435| | - professional_psychology | 0|none | 0|acc |↑ |0.2300|± |0.0423| | - public_relations | 0|none | 0|acc |↑ |0.2200|± |0.0416| | - security_studies | 0|none | 0|acc |↑ |0.2300|± |0.0423| | - sociology | 0|none | 0|acc |↑ |0.2900|± |0.0456| | - us_foreign_policy | 0|none | 0|acc |↑ |0.2800|± |0.0451| | - stem | 1|none | |acc |↑ |0.2342|± |0.0097| | - abstract_algebra | 0|none | 0|acc |↑ |0.2300|± |0.0423| | - anatomy | 0|none | 0|acc |↑ |0.1900|± |0.0394| | - astronomy | 0|none | 0|acc |↑ |0.2100|± |0.0409| | - college_biology | 0|none | 0|acc |↑ |0.2500|± |0.0435| | - college_chemistry | 0|none | 0|acc |↑ |0.2500|± |0.0435| | - college_computer_science | 0|none | 0|acc |↑ |0.2600|± |0.0441| | - college_mathematics | 0|none | 0|acc |↑ |0.2000|± |0.0402| | - college_physics | 0|none | 0|acc |↑ |0.2300|± |0.0423| | - computer_security | 0|none | 0|acc |↑ |0.3000|± |0.0461| | - conceptual_physics | 0|none | 0|acc |↑ |0.3300|± |0.0473| | - electrical_engineering | 0|none | 0|acc |↑ |0.2500|± |0.0435| | - elementary_mathematics | 0|none | 0|acc |↑ |0.2500|± |0.0435| | - high_school_biology | 0|none | 0|acc |↑ |0.1300|± |0.0338| | - high_school_chemistry | 0|none | 0|acc |↑ |0.2000|± |0.0402| | - high_school_computer_science | 0|none | 0|acc |↑ |0.2900|± |0.0456| | - high_school_mathematics | 0|none | 0|acc |↑ |0.2200|± |0.0416| | - high_school_physics | 0|none | 0|acc |↑ |0.2000|± |0.0402| | - high_school_statistics | 0|none | 0|acc |↑ |0.1600|± |0.0368| | - machine_learning | 0|none | 0|acc |↑ |0.3000|± |0.0461| |truthfulqa_mc2 | 2|none | 0|acc |↑ |0.5092|± |0.0455| |winogrande 5 shot | 1|none | 5|acc |↑ |0.5500|± |0.0500|
Expected behavior
the results as the llama3.1 report.