OLMo
OLMo copied to clipboard
Support in-loop evals from oe-eval-internal request dump
This introduces a new downstream task class which reads directly a request dump from running an oe-eval
evaluations. This allows any task configuration (for now only "ranked classification" tasks) in oe-eval to be replicated as an in-loop eval. See some tentative instructions for how to set up tasks here.
The basic idea is to grab the request file from running oe-eval the normal way (alternatively, can run without a model to just save requests) and save it under olmo_data/oe_eval_tasks/<task_name>/requests.jsonl
. Also put a pointer to it in label_to_task_map
in olmo/eval/downstream.py
and then reference in training .yaml
file.
Possibly future features could be:
- Optionally skip reference in
label_to_task_map
, and specify path and metric directly in training.yaml
file - Allow path in S3 as well as in
olmo_data
This would allow adding new tasks without changing any OLMo code, just saving requests in S3 and update the .yaml
files.
An example task added in this PR is added to training .yaml files using:
- label: copycolors_10way
type: downstream
This is a task from @sarahwie with one hundred 10-way MC questions of the type Question: A frog is green. What color is a frog?\n A. green\n B. black
, to test basic MC capabilities.
There is also a version with 1000 questions called copycolors_xl_10way
which cycles the answer choices (A->B, B->C, ...). This should have somewhat less noice than the 100 question one.
On the copycolors_xl_10way task, OLMo-7B-0424 scores 96% while the 350B/400B/450B/500B/600B/700B/1T checkpoints score 10/10/55/22/28/45/78 respectively, showing the (somewhat bumpy) arrival of capability.
I also added a arc_challenge_rc_0shot
task to match the current arc_challenge
task to verify they give identical numbers (wandb link):