OLMo icon indicating copy to clipboard operation
OLMo copied to clipboard

Support in-loop evals from oe-eval-internal request dump

Open OyvindTafjord opened this issue 6 months ago • 4 comments

This introduces a new downstream task class which reads directly a request dump from running an oe-eval evaluations. This allows any task configuration (for now only "ranked classification" tasks) in oe-eval to be replicated as an in-loop eval. See some tentative instructions for how to set up tasks here.

The basic idea is to grab the request file from running oe-eval the normal way (alternatively, can run without a model to just save requests) and save it under olmo_data/oe_eval_tasks/<task_name>/requests.jsonl. Also put a pointer to it in label_to_task_map in olmo/eval/downstream.py and then reference in training .yaml file.

Possibly future features could be:

  • Optionally skip reference in label_to_task_map, and specify path and metric directly in training .yaml file
  • Allow path in S3 as well as in olmo_data

This would allow adding new tasks without changing any OLMo code, just saving requests in S3 and update the .yaml files.

An example task added in this PR is added to training .yaml files using:

  - label: copycolors_10way
    type: downstream

This is a task from @sarahwie with one hundred 10-way MC questions of the type Question: A frog is green. What color is a frog?\n A. green\n B. black, to test basic MC capabilities.

There is also a version with 1000 questions called copycolors_xl_10way which cycles the answer choices (A->B, B->C, ...). This should have somewhat less noice than the 100 question one.

On the copycolors_xl_10way task, OLMo-7B-0424 scores 96% while the 350B/400B/450B/500B/600B/700B/1T checkpoints score 10/10/55/22/28/45/78 respectively, showing the (somewhat bumpy) arrival of capability.

I also added a arc_challenge_rc_0shot task to match the current arc_challenge task to verify they give identical numbers (wandb link): image

OyvindTafjord avatar Aug 01 '24 12:08 OyvindTafjord