Jeremy D issues

Results 14 issues of


                                            Jeremy D

Verify icl cfgs

Run with amp_fp16: | Benchmark | Subcategory | Accuracy | Number few shot | Model | |:---------------|:------------------------------------|-----------:|------------------:|:----------------| | jeopardy | Average | 0.279767 | 0 | mosaicml/mpt-7b | | |...

Model gauntlet

Created model gauntlet. This PR makes a number of significant changes. It checks in 38 datasets, it adds a callback which can compute model gauntlet scores from a large number...

Verify icl cfgs

Confirmed fp16 is slightly better than bf16. I also edited the eval script to be compute averages across benchmarks with sub scores and log the table results in markdown format....

Answers A, B, C, D are not all equally likely - is it really accurate to use random baseline as comparison?

I pulled the test data linked in the README, and I am noticing within each category there is basically never an even 25% split between A, B, C, and D.....

Add deprecation warnings for ICL datasets/helper functions/metrics

# What does this PR do? We are migrating ICL tasks from composer to foundry and need to deprecate the existing composer implementations. The migration PR is here: https://github.com/mosaicml/llm-foundry/pull/936 #...

Remove subclasses from composer

# What does this PR do? This PR removes/deprecates the ICL(Dataset|Metric) subclasses and migrates the relevant tests. This PR is not strictly necessary but would help prevent confusion about where...

[WIP] Error logging callback

# What does this PR do? This PR adds a callback that logs ICL outputs during eval. It modifies the custom metrics to keep track of incorrect model outputs. Each...

Execution Prediction

# What does this PR do? This PR introduces the execution prediction task. It is an auxiliary task compatible with any code evaluation dataset that requires the model to inspect...

Error in canonical solution and tests for HumanEval/163

The following solution and tests don't seem to match the description. Can you explain why we are bounding lower and upper to be between 2 and 8? `def generate_integers(a, b):...

Openai compatible gauntlet

OpenAI run: `api-eval-Ik2iMA` ``` | Category | Benchmark | Subtask | Accuracy | Number few shot | Model | |:-----------|:----------------|:------------------------------------|-----------:|:------------------|:------------------------------| | | gsm8k | | 0.482942 | 0-shot | openai/gpt-3.5-turbo-instruct...