Jeremy D issues

Results 14 issues of


                                            Jeremy D

Refactor qa

This PR is stacked on top of the migration PR https://github.com/mosaicml/llm-foundry/pull/936 It does 5 things 1. Refactor CodeEval and QA tasks to have a shared superclass called InContextLearningGenerationTaskDataset 2. Rename...

Add big bench hard

Adding Big Bench Hard subset as a set of combined CoT tasks, formatted according to the specification in [this repo](https://github.com/suzgunmirac/BIG-Bench-Hard/tree/main). These tasks are quite large and quite slow. I don't...

[wip] F1 score

Implement F1 score for reference-based grading of QA tasks. This PR is dependent on Max's [refactor](https://github.com/mosaicml/composer/pull/2713) added quac, natural questions, and narrative qa Tested mpt-7b-instruct: ``` | Category | Benchmark...

[wip] brier score

Brier score seems of questionable usefulness. COPA results: First number for each model is Brier score. Below we find that accuracy AND brier score both go up with model size...