OLMo Generating config.json in olmo_data/oe_eval

❓ The question

Hi, I'm trying to evaluate the model on new tasks beyond those specified in olmo/eval/downstream.py. To do this, I understand I need to add the corresponding task configuration files in olmo_data/oe_eval_tasks.

Could you please share the script or code you used to generate these evaluation task configurations?

Thanks in advance!

Jun 26 '25 13:06 YeskendirK

Hi, thanks for the question!

To generate the request files, you can use our eval repo: https://github.com/allenai/oe-eval.

You'll need to add zipped data in the same jsonl format (following the data that already exists there as an example). Then you can add a reference to it here: https://github.com/allenai/OLMo/blob/main/olmo/eval/downstream.py#L1611 so it can be used.

Jul 01 '25 18:07 baileykuehl

Thanks for pointing out how to generate the request files.

Regarding the second step, I don't see references for evaluating generative tasks such as GSM8K at https://github.com/allenai/OLMo/blob/main/olmo/eval/downstream.py#L1611. It seems to be heavily focused on multiple-choice tasks.

Could you please point me to the implementation for generative task evaluation?

Thanks in advance!

Jul 04 '25 12:07 YeskendirK

There are a few things I can recommend, depending on what you're trying to do. If you're simply wanting to evaluate some models, then I'd recommend the oe-eval repository, and can point you to the gsm8k eval here: https://github.com/allenai/olmes/blob/main/oe_eval/tasks/oe_eval_tasks/gsm8k.py

Generally, the code you're looking at in the OLMo repo is used for evaluating models during training. We have generative tasks like gsm8k in our OLMo-core repo, reference: https://github.com/allenai/OLMo-core/blob/main/src/olmo_core/train/config.py#L107 and https://github.com/allenai/OLMo-in-loop-evals/tree/main/src/olmo_eval/oe_eval_tasks/gsm8k/gold_bpb_5shot

Jul 07 '25 18:07 baileykuehl

Hello!

I am trying to evaluate the models I have trained or fine-tuned using the code in this repository, rather than models from Hugging Face.

It seems that OLMES works well for evaluating models from Hugging Face, but I am unsure how to evaluate intermediate checkpoints of the olmo-2 models downloaded from the URLs provided in the .csv files in the configs.

I noticed that the current repository includes a method to evaluate models on GSM8K (https://github.com/allenai/OLMo/tree/main/olmo_data/oe_eval_tasks/gsm8k/gold_bpb_5shot), but this is set up as a multiple-choice task rather than a generative task. This evaluation approach also differs from the setup described in the paper.

In summary, I want to evaluate the models I download from the checkpoint URLs and subsequently train or fine-tune on all the tasks presented in your paper. Could you please advise on how to proceed with this evaluation?

Thank you!

Jul 07 '25 18:07 YeskendirK

Hi, thanks for sharing that additional information!

The OLMES suite is actually suited for models in Hugging Face hub, as well as local (or remote) paths, so this should work for the evaluations you described.

To run the gsm8k generative task on your downloaded and/or fine-tuned version of the downloaded model (so long as they're in the right format), you can simply provide the local path to your model using this command:

olmes --model /path/to/local/model --task gsm8k::olmes --output-dir my-eval-dir1

Jul 10 '25 22:07 baileykuehl

Hi! Thanks for your reply.

However, I am trying to evaluate the model on generative tasks (e.g. GSM8K) during training, e.g. after each eval_steps during training. I assume that during the development of olmo-2 models you did the same things. Can you please share scripts or point me to how to do it?

Jul 16 '25 15:07 YeskendirK

No, we did not. We do not do generative GSM8K evaluations in-loop. The reason it appears as a multiple choice task is that we are evaluating perplexity over the human-written continuation, rather than the model generating an answer, and then extracting the answer. This allows us to evaluate quickly during training, but gives us a measure of perplexity (not the canonical task setup).

We pulled checkpoints during the training run and evaluated them offline (not during training) with OLMES.

Jul 16 '25 15:07 baileykuehl

Thanks for the clarification!

In my case, I modified the OLMo() model architecture and trained it using the OLMo pipeline (data, code, and hyperparameters).

As I understand, olmes currently supports evaluation only for models listed in MODEL_CONFIGS, including their fine-tuned variants. Could you advise on the most straightforward way to evaluate a custom model that isn't included in MODEL_CONFIGS?

I really appreciate your help!

Jul 16 '25 16:07 YeskendirK

Hi, for using custom models, my message from last week still applies:

The OLMES suite is actually suited for models in Hugging Face hub, as well as local (or remote) paths, so this should work for the evaluations you described.

To run the gsm8k generative task on your downloaded and/or fine-tuned version of the downloaded model (so long as they're in the right format), you can simply provide the local path to your model using this command:

olmes --model /path/to/local/model --task gsm8k::olmes --output-dir my-eval-dir1

You do not need to register your custom model in the MODEL_CONFIGS dictionary. To evaluate your custom model, you can simply pass your own local or remote model path in to the command above. This will work so long as your model directory contains the expected files / is in the right format. If you haven't already, you'll likely need to convert your model to be in the expected format. You can find documentation for this process here: https://github.com/allenai/OLMo/blob/main/docs/Checkpoints.md

Jul 16 '25 17:07 baileykuehl

Hi!

I tried running the command you provided with the following arguments:

olmes --task arc_challenge::olmes \
      --output-dir eval_results/arc_challenge/olmo-1b-local-test \
      --model "/home/local/folder/checkpoints/OLMo-2-0425-1B/step1907359"

I downloaded the checkpoint from the url provided in OLMo-2-0425-1B.csv file.

However, I encountered the following error: Model should have a model_type key in its config.json, or contain one of the following strings in its name: …

From the code, it looks like the supported model_type values are "hf", "vllm", or "litellm". So unless I'm missing something, it seems that custom models are currently not supported directly.

Could you please confirm this, or let me know the most straightforward way to evaluate a custom OLMo-based model (e.g. a modified OLMo() class)?

Thanks in advance!

Jul 17 '25 12:07 YeskendirK

Hi there, you're getting this error because you need to first convert your model into one of those formats, as I mentioned in my message above:

This will work so long as your model directory contains the expected files / is in the right format. If you haven't already, you'll likely need to convert your model to be in the expected format. You can find documentation for this process here: https://github.com/allenai/OLMo/blob/main/docs/Checkpoints.md

Likely, the best way to do this would be to convert them to Hugging Face ("hf") format. We have a script to convert OLMo 2 models to Hugging Face format, and you should then be able to pass the converted checkpoint directory to OLMES for evaluation.

Jul 28 '25 22:07 baileykuehl

Generating config.json in olmo_data/oe_eval_tasks for new evaluation tasks.

❓ The question