llm-foundry icon indicating copy to clipboard operation
llm-foundry copied to clipboard

Refactor qa

Open bmosaicml opened this issue 1 year ago • 0 comments

This PR is stacked on top of the migration PR https://github.com/mosaicml/llm-foundry/pull/936

It does 5 things

  1. Refactor CodeEval and QA tasks to have a shared superclass called InContextLearningGenerationTaskDataset
  2. Rename QA tasks to InContextLearningGenerationTaskWithAnswersDataset
  3. Introduce shared post-processing functionality shared between all generation tasks. User's can now write arbitrary post-processing functions and add them to a registry that is then accessible via config.
  4. Implement 3 starter post-processing functions that had previously been hardcoded: early stopping, triviaqa-style normalization, regex parsing
  5. Modify the QAAccuracy and CodEval accuracy metrics to apply post-processing functions to the generations at update time.

InContextLearningGenerationTaskDataset handles initialization of the post-processing functions from the config and then the metrics are responsible for applying them to the outputs. This is necessary because CodeEval receives many outputs per input and QAAccuracy receives one.

This refactoring makes us more in-line with Eleuther's eval harness which hallows specifying custom post-processing functions for generate tasks. They support arbitrary regex parsing, whereas we support arbitrary modifications in order to capture the shared commonality between things like triviaqa normalization, early stopping, and regex parsing.

test: mcli logs mpt-eval-rTlNa9

Confirm all performance is identical to before.

| model_name      |   core_average |   world_knowledge |   commonsense_reasoning |   language_understanding |   symbolic_problem_solving |   reading_comprehension |
|:----------------|---------------:|------------------:|------------------------:|-------------------------:|---------------------------:|------------------------:|
| mosaicml/mpt-7b |       0.343081 |          0.421662 |                0.256372 |                 0.634086 |                   0.155426 |                0.247861 |
 Model           |
|:-------------------------|:-----------------------------|:------------------------------------|-----------:|:------------------|:----------------|
| symbolic_problem_solving | gsm8k                        |                                     |  0.0871873 | 0-shot            | mosaicml/mpt-7b |
| commonsense_reasoning    | copa                         |                                     |  0.8       | 0-shot            | mosaicml/mpt-7b |
| commonsense_reasoning    | commonsense_qa               |                                     |  0.225225  | 0-shot            | mosaicml/mpt-7b |
| commonsense_reasoning    | piqa                         |                                     |  0.799238  | 0-shot            | mosaicml/mpt-7b |
| commonsense_reasoning    | bigbench_strange_stories     |                                     |  0.568965  | 0-shot            | mosaicml/mpt-7b |
| commonsense_reasoning    | bigbench_strategy_qa         |                                     |  0.561817  | 0-shot            | mosaicml/mpt-7b |
| language_understanding   | lambada_openai               |                                     |  0.702892  | 0-shot            | mosaicml/mpt-7b |
| language_understanding   | hellaswag                    |                                     |  0.761601  | 0-shot            | mosaicml/mpt-7b |
| reading_comprehension    | coqa                         |                                     |  0.453213  | 0-shot            | mosaicml/mpt-7b |
| reading_comprehension    | boolq                        |                                     |  0.747401  | 0-shot            | mosaicml/mpt-7b |
| world_knowledge          | triviaqa_sm_sub              |                                     |  0.493667  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          | jeopardy                     | Average                             |  0.459835  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | american_history                    |  0.513317  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | literature                          |  0.557143  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | science                             |  0.386555  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | word_origins                        |  0.265753  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | world_history                       |  0.576407  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          | bigbench_qa_wikidata         |                                     |  0.655824  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          | arc_easy                     |                                     |  0.718855  | 3-shot            | mosaicml/mpt-7b |
| world_knowledge          | arc_challenge                |                                     |  0.440273  | 3-shot            | mosaicml/mpt-7b |
| commonsense_reasoning    | siqa                         |                                     |  0.54913   | 3-shot            | mosaicml/mpt-7b |
| language_understanding   | winograd                     |                                     |  0.85348   | 3-shot            | mosaicml/mpt-7b |
| symbolic_problem_solving | bigbench_operators           |                                     |  0.333333  | 3-shot            | mosaicml/mpt-7b |
| reading_comprehension    | squad                        |                                     |  0.553264  | 3-shot            | mosaicml/mpt-7b |
| symbolic_problem_solving | svamp                        |                                     |  0.32      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          | mmlu                         | Average                             |  0.281358  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | abstract_algebra                    |  0.26      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | anatomy                             |  0.303704  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | astronomy                           |  0.309211  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | business_ethics                     |  0.38      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | clinical_knowledge                  |  0.286792  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | college_biology                     |  0.291667  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | college_chemistry                   |  0.21      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | college_computer_science            |  0.25      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | college_mathematics                 |  0.31      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | college_medicine                    |  0.225434  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | college_physics                     |  0.215686  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | computer_security                   |  0.35      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | conceptual_physics                  |  0.289362  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | econometrics                        |  0.245614  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | electrical_engineering              |  0.324138  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | elementary_mathematics              |  0.272487  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | formal_logic                        |  0.222222  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | global_facts                        |  0.32      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_biology                 |  0.3       | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_chemistry               |  0.187192  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_computer_science        |  0.34      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_european_history        |  0.321212  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_geography               |  0.313131  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_government_and_politics |  0.264249  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_macroeconomics          |  0.266667  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_mathematics             |  0.211111  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_microeconomics          |  0.247899  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_physics                 |  0.291391  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_psychology              |  0.251376  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_statistics              |  0.208333  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_us_history              |  0.181373  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | high_school_world_history           |  0.253165  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | human_aging                         |  0.403587  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | human_sexuality                     |  0.259542  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | international_law                   |  0.347107  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | jurisprudence                       |  0.324074  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | logical_fallacies                   |  0.251534  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | machine_learning                    |  0.321429  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | management                          |  0.242718  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | marketing                           |  0.299145  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | medical_genetics                    |  0.22      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | miscellaneous                       |  0.301405  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | moral_disputes                      |  0.32659   | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | moral_scenarios                     |  0.259218  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | nutrition                           |  0.30719   | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | philosophy                          |  0.315113  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | prehistory                          |  0.302469  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | professional_accounting             |  0.248227  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | professional_law                    |  0.269231  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | professional_medicine               |  0.198529  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | professional_psychology             |  0.271242  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | public_relations                    |  0.381818  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | security_studies                    |  0.236735  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | sociology                           |  0.268657  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | us_foreign_policy                   |  0.36      | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | virology                            |  0.349398  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          |                              | world_religions                     |  0.269006  | 5-shot            | mosaicml/mpt-7b |
| symbolic_problem_solving | bigbench_dyck_languages      |                                     |  0.304     | 5-shot            | mosaicml/mpt-7b |
| language_understanding   | winogrande                   |                                     |  0.722178  | 5-shot            | mosaicml/mpt-7b |
| symbolic_problem_solving | agi_eval_lsat_ar             |                                     |  0.23913   | 5-shot            | mosaicml/mpt-7b |
| symbolic_problem_solving | simple_arithmetic_nospaces   |                                     |  0.082     | 5-shot            | mosaicml/mpt-7b |
| symbolic_problem_solving | simple_arithmetic_withspaces |                                     |  0.089     | 5-shot            | mosaicml/mpt-7b |
| reading_comprehension    | agi_eval_lsat_rc             |                                     |  0.235075  | 5-shot            | mosaicml/mpt-7b |
| reading_comprehension    | agi_eval_lsat_lr             |                                     |  0.247059  | 5-shot            | mosaicml/mpt-7b |
| reading_comprehension    | agi_eval_sat_en              |                                     |  0.257282  | 5-shot            | mosaicml/mpt-7b |
| world_knowledge          | arc_challenge                |                                     |  0.4343    | 25-shot           | mosaicml/mpt-7b |
| commonsense_reasoning    | openbook_qa                  |                                     |  0.452     | 10-shot           | mosaicml/mpt-7b |
| language_understanding   | hellaswag                    |                                     |  0.765385  | 10-shot           | mosaicml/mpt-7b |
|                          | bigbench_cs_algorithms       |                                     |  0.480303  | 10-shot           | mosaicml/mpt-7b |
| symbolic_problem_solving | bigbench_elementary_math_qa  |                                     |  0.281787  | 1-shot            | mosaicml/mpt-7b |

bmosaicml avatar Feb 20 '24 21:02 bmosaicml