llm-foundry
llm-foundry copied to clipboard
Model gauntlet
Created model gauntlet.
This PR makes a number of significant changes. It checks in 38 datasets, it adds a callback which can compute model gauntlet scores from a large number of benchmarks. It also documents the model gauntlet datasets in eval/local_data/README.md
Eval successfully runs and produces correct results
Printing gauntlet results for all models
| model_name | average | world_knowledge | commonsense_reasoning | language_understanding | symbolic_problem_solving | reading_comprehension | programming |
|:-------------------------|----------:|------------------:|------------------------:|-------------------------:|---------------------------:|------------------------:|--------------:|
| mosaicml/mpt-7b-instruct | 0.303923 | 0.400286 | 0.415097 | 0.422248 | 0.171216 | 0.414691 | 0 |
Printing complete results for all models
| Category | Benchmark | Subtask | Accuracy | Number few shot | Model |
|:-------------------------|:---------------------------------|:------------------------------------|-----------:|------------------:|:-------------------------|
| world_knowledge | jeopardy | Average | 0.458112 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | american_history | 0.51816 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | literature | 0.540816 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | science | 0.34874 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | word_origins | 0.287671 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | world_history | 0.595174 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | bigbench_qa_wikidata | | 0.694503 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | arc_easy | | 0.748737 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | arc_challenge | | 0.47099 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | mmlu | Average | 0.312989 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | abstract_algebra | 0.31 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | anatomy | 0.311111 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | astronomy | 0.315789 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | business_ethics | 0.26 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | clinical_knowledge | 0.316981 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | college_biology | 0.256944 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | college_chemistry | 0.33 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | college_computer_science | 0.29 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | college_mathematics | 0.29 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | college_medicine | 0.271676 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | college_physics | 0.264706 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | computer_security | 0.37 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | conceptual_physics | 0.33617 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | econometrics | 0.192982 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | electrical_engineering | 0.324138 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | elementary_mathematics | 0.259259 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | formal_logic | 0.301587 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | global_facts | 0.35 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | high_school_biology | 0.33871 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | high_school_chemistry | 0.270936 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | high_school_computer_science | 0.29 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | high_school_european_history | 0.30303 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | high_school_geography | 0.388889 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | high_school_government_and_politics | 0.362694 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | high_school_macroeconomics | 0.325641 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | high_school_mathematics | 0.288889 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | high_school_microeconomics | 0.331933 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | high_school_physics | 0.311258 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | high_school_psychology | 0.308257 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | high_school_statistics | 0.388889 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | high_school_us_history | 0.27451 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | high_school_world_history | 0.261603 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | human_aging | 0.372197 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | human_sexuality | 0.374046 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | international_law | 0.31405 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | jurisprudence | 0.342593 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | logical_fallacies | 0.226994 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | machine_learning | 0.241071 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | management | 0.339806 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | marketing | 0.320513 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | medical_genetics | 0.34 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | miscellaneous | 0.386973 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | moral_disputes | 0.323699 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | moral_scenarios | 0.251397 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | nutrition | 0.366013 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | philosophy | 0.37299 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | prehistory | 0.33642 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | professional_accounting | 0.27305 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | professional_law | 0.273794 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | professional_medicine | 0.220588 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | professional_psychology | 0.287582 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | public_relations | 0.418182 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | security_studies | 0.334694 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | sociology | 0.308458 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | us_foreign_policy | 0.37 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | virology | 0.385542 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | | world_religions | 0.263158 | 10 | mosaicml/mpt-7b-instruct |
| world_knowledge | bigbench_misconceptions | | 0.60274 | 10 | mosaicml/mpt-7b-instruct |
| commonsense_reasoning | piqa | | 0.806311 | 10 | mosaicml/mpt-7b-instruct |
| commonsense_reasoning | bigbench_novel_concepts | | 0.53125 | 10 | mosaicml/mpt-7b-instruct |
| commonsense_reasoning | bigbench_strange_stories | | 0.701149 | 10 | mosaicml/mpt-7b-instruct |
| commonsense_reasoning | bigbench_strategy_qa | | 0.59633 | 10 | mosaicml/mpt-7b-instruct |
| language_understanding | hellaswag | | 0.769767 | 10 | mosaicml/mpt-7b-instruct |
| language_understanding | bigbench_conlang_translation | | 0.0426829 | 10 | mosaicml/mpt-7b-instruct |
| language_understanding | bigbench_language_identification | | 0.2568 | 10 | mosaicml/mpt-7b-instruct |
| language_understanding | bigbench_conceptual_combinations | | 0.320388 | 10 | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving | bigbench_elementary_math_qa | | 0.270466 | 10 | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving | bigbench_dyck_languages | | 0.314 | 10 | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving | bigbench_cs_algorithms | | 0.496212 | 10 | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving | bigbench_logical_deduction | | 0.262667 | 10 | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving | bigbench_operators | | 0.352381 | 10 | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving | bigbench_repeat_copy_logic | | 0.3125 | 10 | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving | simple_arithmetic_nospaces | | 0.078 | 10 | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving | simple_arithmetic_withspaces | | 0.086 | 10 | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving | math_qa | | 0.257459 | 10 | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving | logi_qa | | 0.264209 | 10 | mosaicml/mpt-7b-instruct |
| reading_comprehension | pubmed_qa_labeled | | 0.59 | 10 | mosaicml/mpt-7b-instruct |
| reading_comprehension | squad | | 0.586944 | 10 | mosaicml/mpt-7b-instruct |
| reading_comprehension | bigbench_understanding_fables | | 0.195767 | 10 | mosaicml/mpt-7b-instruct |
| reading_comprehension | boolq | | 0.777064 | 10 | mosaicml/mpt-7b-instruct |
| commonsense_reasoning | copa | | 0.83 | 0 | mosaicml/mpt-7b-instruct |
| commonsense_reasoning | openbook_qa | | 0.436 | 0 | mosaicml/mpt-7b-instruct |
| language_understanding | lambada_openai | | 0.69086 | 0 | mosaicml/mpt-7b-instruct |
| language_understanding | winograd | | 0.846154 | 0 | mosaicml/mpt-7b-instruct |
| language_understanding | winogrande | | 0.67719 | 0 | mosaicml/mpt-7b-instruct |