llm-foundry icon indicating copy to clipboard operation
llm-foundry copied to clipboard

[wip] F1 score

Open bmosaicml opened this issue 1 year ago • 1 comments

Implement F1 score for reference-based grading of QA tasks.

This PR is dependent on Max's refactor

added quac, natural questions, and narrative qa

Tested mpt-7b-instruct:

| Category   | Benchmark                        | Subtask   |     Score | Metric name                        | Number few shot   | Model                    |
|:-----------|:---------------------------------|:----------|----------:|:-----------------------------------|:------------------|:-------------------------|
|            | quac                             |           | 0.190577  | InContextLearningGenerationF1Score | 0-shot            | mosaicml/mpt-7b-instruct |
|            | natural_questions_closed         |           | 0.0520243 | InContextLearningGenerationF1Score | 0-shot            | mosaicml/mpt-7b-instruct |
|            | natural_questions_openbook_short |           | 0.23619   | InContextLearningGenerationF1Score | 0-shot            | mosaicml/mpt-7b-instruct |
|            | narrativeqa                      |           | 0.135528  | InContextLearningGenerationF1Score | 0-shot            | mosaicml/mpt-7b-instruct |

Tested MPT30: mpt-30b-f1-IqFLpz

| Category   | Benchmark                        | Subtask   |    Score | Metric name                        | Number few shot   | Model                     |
|:-----------|:---------------------------------|:----------|---------:|:-----------------------------------|:------------------|:--------------------------|
|            | quac                             |           | 0.143547 | InContextLearningGenerationF1Score | 0-shot            | mosaicml/mpt-30b-instruct |
|            | natural_questions_closed         |           | 0.051504 | InContextLearningGenerationF1Score | 0-shot            | mosaicml/mpt-30b-instruct |
|            | natural_questions_openbook_short |           | 0.112643 | InContextLearningGenerationF1Score | 0-shot            | mosaicml/mpt-30b-instruct |
|            | narrativeqa                      |           | 0.110033 | InContextLearningGenerationF1Score | 0-shot            | mosaicml/mpt-30b-instruct |
|            | quac                             |           | 0.187507 | InContextLearningGenerationF1Score | 0-shot            | mosaicml/mpt-30b          |
|            | natural_questions_closed         |           | 0.05403  | InContextLearningGenerationF1Score | 0-shot            | mosaicml/mpt-30b          |
|            | natural_questions_openbook_short |           | 0.118126 | InContextLearningGenerationF1Score | 0-shot            | mosaicml/mpt-30b          |
|            | narrativeqa                      |           | 0.16298  | InContextLearningGenerationF1Score | 0-shot            | mosaicml/mpt-30b          |

bmosaicml avatar Dec 13 '23 05:12 bmosaicml

Can we test a model with external QuAC, NaturalQuestions, etc. results, to make sure our implementation produces numbers that are correct? E.g. Llama 2.

nik-mosaic avatar Dec 13 '23 06:12 nik-mosaic

Feel free to reopen if you are still doing this

dakinggg avatar May 16 '24 23:05 dakinggg