llm-foundry icon indicating copy to clipboard operation
llm-foundry copied to clipboard

Openai compatible gauntlet

Open bmosaicml opened this issue 1 year ago • 0 comments

OpenAI run: api-eval-Ik2iMA

| Category   | Benchmark       | Subtask                             |   Accuracy | Number few shot   | Model                         |
|:-----------|:----------------|:------------------------------------|-----------:|:------------------|:------------------------------|
|            | gsm8k           |                                     |   0.482942 | 0-shot            | openai/gpt-3.5-turbo-instruct |
|            | lambada_openai  |                                     |   0.782651 | 0-shot            | openai/gpt-3.5-turbo-instruct |
|            | triviaqa_sm_sub |                                     |   0.727667 | 3-shot            | openai/gpt-3.5-turbo-instruct |
|            | jeopardy        | Average                             |   0.553084 | 3-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | american_history                    |   0.602906 | 3-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | literature                          |   0.714286 | 3-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | science                             |   0.434874 | 3-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | word_origins                        |   0.372603 | 3-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | world_history                       |   0.640751 | 3-shot            | openai/gpt-3.5-turbo-instruct |
|            | arc_challenge   |                                     |   0.687713 | 25-shot           | openai/gpt-3.5-turbo-instruct |
|            | mmlu            | Average                             |   0.713291 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | abstract_algebra                    |   0.47     | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | anatomy                             |   0.674074 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | astronomy                           |   0.776316 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | business_ethics                     |   0.79     | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | clinical_knowledge                  |   0.750943 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | college_biology                     |   0.763889 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | college_chemistry                   |   0.53     | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | college_computer_science            |   0.57     | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | college_mathematics                 |   0.47     | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | college_medicine                    |   0.699422 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | college_physics                     |   0.54902  | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | computer_security                   |   0.81     | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | conceptual_physics                  |   0.67234  | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | econometrics                        |   0.570175 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | electrical_engineering              |   0.662069 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | elementary_mathematics              |   0.608466 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | formal_logic                        |   0.642857 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | global_facts                        |   0.48     | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | high_school_biology                 |   0.809677 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | high_school_chemistry               |   0.571429 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | high_school_computer_science        |   0.8      | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | high_school_european_history        |   0.70303  | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | high_school_geography               |   0.818182 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | high_school_government_and_politics |   0.906736 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | high_school_macroeconomics          |   0.720513 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | high_school_mathematics             |   0.507407 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | high_school_microeconomics          |   0.785714 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | high_school_physics                 |   0.509934 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | high_school_psychology              |   0.838532 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | high_school_statistics              |   0.564815 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | high_school_us_history              |   0.823529 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | high_school_world_history           |   0.763713 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | human_aging                         |   0.7713   | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | human_sexuality                     |   0.847328 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | international_law                   |   0.859504 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | jurisprudence                       |   0.768519 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | logical_fallacies                   |   0.809816 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | machine_learning                    |   0.625    | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | management                          |   0.815534 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | marketing                           |   0.884615 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | medical_genetics                    |   0.88     | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | miscellaneous                       |   0.872286 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | moral_disputes                      |   0.710983 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | moral_scenarios                     |   0.436871 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | nutrition                           |   0.761438 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | philosophy                          |   0.713826 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | prehistory                          |   0.783951 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | professional_accounting             |   0.56383  | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | professional_law                    |   0.557366 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | professional_medicine               |   0.768382 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | professional_psychology             |   0.73366  | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | public_relations                    |   0.790909 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | security_studies                    |   0.763265 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | sociology                           |   0.850746 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | us_foreign_policy                   |   0.93     | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | virology                            |   0.662651 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | world_religions                     |   0.883041 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            | hellaswag       |                                     |   0.706333 | 10-shot           | openai/gpt-3.5-turbo-instruct |

bmosaicml avatar Mar 08 '24 00:03 bmosaicml