Jeremy D
Jeremy D
Run with amp_fp16: | Benchmark | Subcategory | Accuracy | Number few shot | Model | |:---------------|:------------------------------------|-----------:|------------------:|:----------------| | jeopardy | Average | 0.279767 | 0 | mosaicml/mpt-7b | | |...
Created model gauntlet. This PR makes a number of significant changes. It checks in 38 datasets, it adds a callback which can compute model gauntlet scores from a large number...
Confirmed fp16 is slightly better than bf16. I also edited the eval script to be compute averages across benchmarks with sub scores and log the table results in markdown format....
I pulled the test data linked in the README, and I am noticing within each category there is basically never an even 25% split between A, B, C, and D.....
# What does this PR do? We are migrating ICL tasks from composer to foundry and need to deprecate the existing composer implementations. The migration PR is here: https://github.com/mosaicml/llm-foundry/pull/936 #...
# What does this PR do? This PR removes/deprecates the ICL(Dataset|Metric) subclasses and migrates the relevant tests. This PR is not strictly necessary but would help prevent confusion about where...
# What does this PR do? This PR adds a callback that logs ICL outputs during eval. It modifies the custom metrics to keep track of incorrect model outputs. Each...
# What does this PR do? This PR introduces the execution prediction task. It is an auxiliary task compatible with any code evaluation dataset that requires the model to inspect...
The following solution and tests don't seem to match the description. Can you explain why we are bounding lower and upper to be between 2 and 8? `def generate_integers(a, b):...
OpenAI run: `api-eval-Ik2iMA` ``` | Category | Benchmark | Subtask | Accuracy | Number few shot | Model | |:-----------|:----------------|:------------------------------------|-----------:|:------------------|:------------------------------| | | gsm8k | | 0.482942 | 0-shot | openai/gpt-3.5-turbo-instruct...