llm-foundry
llm-foundry copied to clipboard
Openai compatible gauntlet
OpenAI run: api-eval-Ik2iMA
| Category | Benchmark | Subtask | Accuracy | Number few shot | Model |
|:-----------|:----------------|:------------------------------------|-----------:|:------------------|:------------------------------|
| | gsm8k | | 0.482942 | 0-shot | openai/gpt-3.5-turbo-instruct |
| | lambada_openai | | 0.782651 | 0-shot | openai/gpt-3.5-turbo-instruct |
| | triviaqa_sm_sub | | 0.727667 | 3-shot | openai/gpt-3.5-turbo-instruct |
| | jeopardy | Average | 0.553084 | 3-shot | openai/gpt-3.5-turbo-instruct |
| | | american_history | 0.602906 | 3-shot | openai/gpt-3.5-turbo-instruct |
| | | literature | 0.714286 | 3-shot | openai/gpt-3.5-turbo-instruct |
| | | science | 0.434874 | 3-shot | openai/gpt-3.5-turbo-instruct |
| | | word_origins | 0.372603 | 3-shot | openai/gpt-3.5-turbo-instruct |
| | | world_history | 0.640751 | 3-shot | openai/gpt-3.5-turbo-instruct |
| | arc_challenge | | 0.687713 | 25-shot | openai/gpt-3.5-turbo-instruct |
| | mmlu | Average | 0.713291 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | abstract_algebra | 0.47 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | anatomy | 0.674074 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | astronomy | 0.776316 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | business_ethics | 0.79 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | clinical_knowledge | 0.750943 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | college_biology | 0.763889 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | college_chemistry | 0.53 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | college_computer_science | 0.57 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | college_mathematics | 0.47 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | college_medicine | 0.699422 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | college_physics | 0.54902 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | computer_security | 0.81 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | conceptual_physics | 0.67234 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | econometrics | 0.570175 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | electrical_engineering | 0.662069 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | elementary_mathematics | 0.608466 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | formal_logic | 0.642857 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | global_facts | 0.48 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | high_school_biology | 0.809677 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | high_school_chemistry | 0.571429 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | high_school_computer_science | 0.8 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | high_school_european_history | 0.70303 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | high_school_geography | 0.818182 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | high_school_government_and_politics | 0.906736 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | high_school_macroeconomics | 0.720513 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | high_school_mathematics | 0.507407 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | high_school_microeconomics | 0.785714 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | high_school_physics | 0.509934 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | high_school_psychology | 0.838532 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | high_school_statistics | 0.564815 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | high_school_us_history | 0.823529 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | high_school_world_history | 0.763713 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | human_aging | 0.7713 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | human_sexuality | 0.847328 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | international_law | 0.859504 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | jurisprudence | 0.768519 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | logical_fallacies | 0.809816 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | machine_learning | 0.625 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | management | 0.815534 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | marketing | 0.884615 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | medical_genetics | 0.88 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | miscellaneous | 0.872286 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | moral_disputes | 0.710983 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | moral_scenarios | 0.436871 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | nutrition | 0.761438 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | philosophy | 0.713826 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | prehistory | 0.783951 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | professional_accounting | 0.56383 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | professional_law | 0.557366 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | professional_medicine | 0.768382 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | professional_psychology | 0.73366 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | public_relations | 0.790909 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | security_studies | 0.763265 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | sociology | 0.850746 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | us_foreign_policy | 0.93 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | virology | 0.662651 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | | world_religions | 0.883041 | 5-shot | openai/gpt-3.5-turbo-instruct |
| | hellaswag | | 0.706333 | 10-shot | openai/gpt-3.5-turbo-instruct |