llm-foundry
llm-foundry copied to clipboard
[wip] F1 score
Implement F1 score for reference-based grading of QA tasks.
This PR is dependent on Max's refactor
added quac, natural questions, and narrative qa
Tested mpt-7b-instruct:
| Category | Benchmark | Subtask | Score | Metric name | Number few shot | Model |
|:-----------|:---------------------------------|:----------|----------:|:-----------------------------------|:------------------|:-------------------------|
| | quac | | 0.190577 | InContextLearningGenerationF1Score | 0-shot | mosaicml/mpt-7b-instruct |
| | natural_questions_closed | | 0.0520243 | InContextLearningGenerationF1Score | 0-shot | mosaicml/mpt-7b-instruct |
| | natural_questions_openbook_short | | 0.23619 | InContextLearningGenerationF1Score | 0-shot | mosaicml/mpt-7b-instruct |
| | narrativeqa | | 0.135528 | InContextLearningGenerationF1Score | 0-shot | mosaicml/mpt-7b-instruct |
Tested MPT30: mpt-30b-f1-IqFLpz
| Category | Benchmark | Subtask | Score | Metric name | Number few shot | Model |
|:-----------|:---------------------------------|:----------|---------:|:-----------------------------------|:------------------|:--------------------------|
| | quac | | 0.143547 | InContextLearningGenerationF1Score | 0-shot | mosaicml/mpt-30b-instruct |
| | natural_questions_closed | | 0.051504 | InContextLearningGenerationF1Score | 0-shot | mosaicml/mpt-30b-instruct |
| | natural_questions_openbook_short | | 0.112643 | InContextLearningGenerationF1Score | 0-shot | mosaicml/mpt-30b-instruct |
| | narrativeqa | | 0.110033 | InContextLearningGenerationF1Score | 0-shot | mosaicml/mpt-30b-instruct |
| | quac | | 0.187507 | InContextLearningGenerationF1Score | 0-shot | mosaicml/mpt-30b |
| | natural_questions_closed | | 0.05403 | InContextLearningGenerationF1Score | 0-shot | mosaicml/mpt-30b |
| | natural_questions_openbook_short | | 0.118126 | InContextLearningGenerationF1Score | 0-shot | mosaicml/mpt-30b |
| | narrativeqa | | 0.16298 | InContextLearningGenerationF1Score | 0-shot | mosaicml/mpt-30b |
Can we test a model with external QuAC, NaturalQuestions, etc. results, to make sure our implementation produces numbers that are correct? E.g. Llama 2.
Feel free to reopen if you are still doing this