torchtune [Clean up] Move evaluation configs under model directories

Currently, evaluation.yaml exists under the configs/ directory. To start, we wanted to just showcase this recipes as an example, but it is a core part of the finetuning process and therefore should mirror the pattern we've established for other configs in which they reside under model-specific directories.

The change for each model directory will consist of four steps:

Copy evaluation.yaml under whichever model you are focused upon.
Update the defaults from llama2 to the current model defaults.
Update the _recipe_registry.py to make sure the new YAML file can be found with the following command: tune run eleuther_eval --config MODEL/evaluation
Put up a PR with output from running the evaluation script. Here's an example for Qwen2: #1809

If there are multiple sizes of model that exist in the directory, select the most commonly used one. This is certainly up for interpretation, but typically ~7B params is standard. We want to give a good example, but there's no need to proliferate configs for every model SIZE.

Checklist:

[ ] Llama2
[x] Code-Llama2 (#2209, thanks @ReemaAlzaid)
[ ] Llama3
[ ] Llama3.1
[x] Llama3.2 (#2186, thanks @ReemaAlzaid)
[x] Llama3.2V
[x] Mistral (#1829, thanks @Yousof-kayal)
[x] Phi3 (#1822, thanks @Harthi7)
[x] Gemma (#1819, thanks @malinjawi)
[ ] Gemma2
[x] Qwen2
[x] Qwen2.5 (#2230, thanks @Ankur-singh)

After all of these are completed, we will deprecate the evaluation.yaml configs in the base configs directory.

Thanks, everyone, for your help! 🎉

Oct 11 '24 13:10 joecummings

hey @joecummings I am new to torchtune but I am interested in contributing to this issue.

edit: I will be working on Gemma for now

Oct 11 '24 15:10 malinjawi

Hello @joecummings I'm a new dev and am starting out with open source contribution. I wanted to let you know that I've reviewed the problem you've reported and I'll do my best addressing it.

Model: Mistral

Oct 11 '24 15:10 Yousof-kayal

Thanks so much @malinjawi and @Yousof-kayal - Can you update your comments with which models you're planning on addressing?

Happy to review any PRs you put up :)

Oct 11 '24 17:10 joecummings

Hello joecummings, I am also working on this issue and I plan to work on Phi3

Oct 11 '24 17:10 Harthi7

@joecummings To run the last command (tune run eleuther_eval --config MODEL/evaluation), I had to modify the values for checkpoint_dir, output_dir, and path in the YAML file to match my file locations. Should I revert them back to /tmp/Phi-3-mini-4k-instruct?

Oct 11 '24 22:10 Harthi7

@joecummings To run the last command (tune run eleuther_eval --config MODEL/evaluation), I had to modify the values for checkpoint_dir, output_dir, and path in the YAML file to match my file locations. Should I revert them back to /tmp/Phi-3-mini-4k-instruct?

yes please use the default model locations for now!

Oct 12 '24 10:10 joecummings

@joecummings To run the last command (tune run eleuther_eval --config MODEL/evaluation), I had to modify the values for checkpoint_dir, output_dir, and path in the YAML file to match my file locations. Should I revert them back to /tmp/Phi-3-mini-4k-instruct?

yes please use the default model locations for now!

hey @joecummings, I also noticed that i needed to change the model location when running the command. I was able to run it tho is this good for a PR? Do you have any suggestions if we on the wrong track?

(torchtune_rosetta) linjaboy@Mohammads-MacBook-Pro 9cf48e52b224239de00d483ec8eb84fb8d0f3a3a % tune run eleuther_eval --config gemma/evaluation
W1012 01:59:27.011000 8088854208 torch/distributed/elastic/multiprocessing/redirects.py:28] NOTE: Redirects are currently not supported in Windows or MacOs.
INFO:torchtune.utils._logging:Running EleutherEvalRecipe with resolved config:

batch_size: 8
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /tmp/gemma-2b/models--google--gemma-2b/snapshots/9cf48e52b224239de00d483ec8eb84fb8d0f3a3a
  checkpoint_files:
  - model-00001-of-00002.safetensors
  - model-00002-of-00002.safetensors
  model_type: GEMMA
  output_dir: ./
device: cpu
dtype: bf16
enable_kv_cache: true
limit: null
max_seq_length: 4096
model:
  _component_: torchtune.models.gemma.gemma_2b
quantizer: null
seed: 1234
tasks:
- truthfulqa_mc2
tokenizer:
  _component_: torchtune.models.gemma.gemma_tokenizer
  path: /tmp/gemma-2b/models--google--gemma-2b/snapshots/9cf48e52b224239de00d483ec8eb84fb8d0f3a3a/tokenizer.model

INFO:torchtune.utils._logging:Model is initialized with precision torch.bfloat16.
config.json: 100%|███████████████████████████████████████████████████████████████████████████| 665/665 [00:00<00:00, 1.06MB/s]
tokenizer_config.json: 100%|████████████████████████████████████████████████████████████████| 26.0/26.0 [00:00<00:00, 119kB/s]
vocab.json: 100%|████████████████████████████████████████████████████████████████████████| 1.04M/1.04M [00:00<00:00, 1.14MB/s]
merges.txt: 100%|██████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 1.57MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████| 1.36M/1.36M [00:00<00:00, 2.14MB/s]
model.safetensors: 100%|███████████████████████████████████████████████████████████████████| 548M/548M [01:32<00:00, 5.95MB/s]
generation_config.json: 100%|█████████████████████████████████████████████████████████████████| 124/124 [00:00<00:00, 712kB/s]
README.md: 100%|█████████████████████████████████████████████████████████████████████████| 9.59k/9.59k [00:00<00:00, 5.65MB/s]
validation-00000-of-00001.parquet: 100%|███████████████████████████████████████████████████| 271k/271k [00:00<00:00, 2.79MB/s]
Generating validation split: 100%|████████████████████████████████████████████████| 817/817 [00:00<00:00, 26416.89 examples/s]
INFO:torchtune.utils._logging:Running evaluation on the following tasks: ['truthfulqa_mc2']
INFO:lm-eval:Building contexts for truthfulqa_mc2 on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████| 817/817 [00:00<00:00, 2121.83it/s]
INFO:lm-eval:Running loglikelihood requests
Running loglikelihood requests: 100%|███████████████████████████████████████████████████| 5882/5882 [8:26:40<00:00,  5.17s/it]
INFO:torchtune.utils._logging:Eval completed in 30404.48 seconds.
INFO:torchtune.utils._logging:Max memory allocated: 0.00 GB
INFO:torchtune.utils._logging:

|    Tasks     |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|--------------|------:|------|-----:|------|---|-----:|---|-----:|
|truthfulqa_mc2|      2|none  |     0|acc   |↑  |0.3995|±  |0.0152|

Oct 12 '24 11:10 malinjawi

This the log after I ran the command tune run eleuther_eval --config MODEL/evaluation

(torchtune) Abdullahs-MacBook-Pro:Phi-3-mini-4k-instruct abdullah$ tune run eleuther_eval --config phi3/evaluation
W1012 00:32:22.842000 8517586752 torch/distributed/elastic/multiprocessing/redirects.py:28] NOTE: Redirects are currently not supported in Windows or MacOs.
INFO:torchtune.utils._logging:Running EleutherEvalRecipe with resolved config:

batch_size: 8
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /tmp/Phi-3-mini-4k-instruct/models--microsoft--Phi-3-mini-4k-instruct/snapshots/0a67737cc96d2554230f90338b163bc6380a2a85
  checkpoint_files:
  - model-00001-of-00002.safetensors
  - model-00002-of-00002.safetensors
  model_type: PHI3_MINI
  output_dir: /tmp/Phi-3-mini-4k-instruct/models--microsoft--Phi-3-mini-4k-instruct/snapshots/0a67737cc96d2554230f90338b163bc6380a2a85
  recipe_checkpoint: null
device: cpu
dtype: bf16
enable_kv_cache: true
limit: null
max_seq_length: 4096
model:
  _component_: torchtune.models.phi3.phi3_mini
quantizer: null
resume_from_checkpoint: false
seed: 1234
tasks:
- truthfulqa_mc2
tokenizer:
  _component_: torchtune.models.phi3.phi3_mini_tokenizer
  max_seq_len: null
  path: /tmp/Phi-3-mini-4k-instruct/tokenizer.model

INFO:torchtune.utils._logging:Converting Phi-3 Mini weights from HF format.Note that conversion of adapter weights into PEFT format is not supported.
INFO:torchtune.utils._logging:Model is initialized with precision torch.bfloat16.
config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 665/665 [00:00<00:00, 769kB/s]
tokenizer_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 26.0/26.0 [00:00<00:00, 106kB/s]
vocab.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.04M/1.04M [00:00<00:00, 1.16MB/s]
merges.txt: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 1.26MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.36M/1.36M [00:00<00:00, 1.90MB/s]
model.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 548M/548M [03:43<00:00, 2.45MB/s]
generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 124/124 [00:00<00:00, 2.14MB/s]
README.md: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.59k/9.59k [00:00<00:00, 16.1MB/s]
validation-00000-of-00001.parquet: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 271k/271k [00:00<00:00, 1.36MB/s]
Generating validation split: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 817/817 [00:00<00:00, 45492.21 examples/s]
INFO:torchtune.utils._logging:Running evaluation on the following tasks: ['truthfulqa_mc2']
INFO:lm-eval:Building contexts for truthfulqa_mc2 on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 817/817 [00:00<00:00, 2446.11it/s]
INFO:lm-eval:Running loglikelihood requests
Running loglikelihood requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 5882/5882 [22:42:29<00:00, 13.90s/it]
INFO:torchtune.utils._logging:Eval completed in 81751.63 seconds.
INFO:torchtune.utils._logging:Max memory allocated: 0.00 GB
INFO:torchtune.utils._logging:

|    Tasks     |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|--------------|------:|------|-----:|------|---|-----:|---|-----:|
|truthfulqa_mc2|      2|none  |     0|acc   |↑  |0.5456|±  |0.0151|

Oct 12 '24 20:10 Harthi7

I have made the changes and ran tune run eleuther_eval --config MODEL/evaluation. The command runs, but since my machine does not meet the performance requirements, it kills the evaluation early.

Is it okay to PR or should I test with another machine? @joecummings

Oct 12 '24 21:10 Yousof-kayal

I have made the changes and ran tune run eleuther_eval --config MODEL/evaluation. The command runs, but since my machine does not meet the performance requirements, it kills the evaluation early.

Is it okay to PR or should I test with another machine? @joecummings

Go ahead and put up the PR and I'll make sure to double check on a beefier machine.

Oct 14 '24 13:10 joecummings

Hey @joecummings, I'd like to work on Llama3.2 and possibly Code-Llama2. For each model, should I evaluate both parameter variants (e.g, Llama3.2 has 1B and 3B parameters)? Or is there a specific focus on one configuration for evaluation?

Nov 16 '24 22:11 ReemaAlzaid

Hey @joecummings, I'd like to work on Llama3.2 and possibly Code-Llama2. For each model, should I evaluate both parameter variants (e.g, Llama3.2 has 1B and 3B parameters)? Or is there a specific focus on one configuration for evaluation?

Hey @ReemaAlzaid - sorry for the late reply!! Would love for you to work on this. For Llama3.2, please just create an evaluation config for the 3B model.

Nov 18 '24 16:11 joecummings

Hey @joecummings Kindly review this PR https://github.com/pytorch/torchtune/pull/2186 for the Llama3.2 3B eval config. I'm more than happy to take on the remaining models and submit a PR for them

Dec 20 '24 17:12 ReemaAlzaid

Hey @joecummings Kindly review this PR #2186 for the Llama3.2 3B eval config. I'm more than happy to take on the remaining models and submit a PR for them

Approved! Would love for you to take on the remainder - tag me on any reviews you need :)

Dec 20 '24 17:12 joecummings

Hello @joecummings, I'm planning to work on Qwen2.5 model.

Model: Qwen2.5

Jan 03 '25 15:01 Ankur-singh

Hey @joecummings, here is the output for tune run eleuther_eval --config qwen2_5/evaluation

tune run eleuther_eval --config qwen2_5/evaluation
Running EleutherEvalRecipe with resolved config:

batch_size: 8
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /tmp/Qwen2_5-0_5B-Instruct
  checkpoint_files:
  - model.safetensors
  model_type: QWEN2
  output_dir: ./
device: cuda
dtype: bf16
enable_kv_cache: true
limit: null
max_seq_length: 4096
model:
  _component_: torchtune.models.qwen2_5.qwen2_5_0_5b
output_dir: ./
quantizer: null
seed: 1234
tasks:
- truthfulqa_mc2
tokenizer:
  _component_: torchtune.models.qwen2_5.qwen2_5_tokenizer
  max_seq_len: null
  merges_file: /tmp/Qwen2_5-0_5B-Instruct/merges.txt
  path: /tmp/Qwen2_5-0_5B-Instruct/vocab.json

2025-01-04:12:21:56,964 INFO     [_utils.py:28] Running EleutherEvalRecipe with resolved config:

batch_size: 8
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /tmp/Qwen2_5-0_5B-Instruct
  checkpoint_files:
  - model.safetensors
  model_type: QWEN2
  output_dir: ./
device: cuda
dtype: bf16
enable_kv_cache: true
limit: null
max_seq_length: 4096
model:
  _component_: torchtune.models.qwen2_5.qwen2_5_0_5b
output_dir: ./
quantizer: null
seed: 1234
tasks:
- truthfulqa_mc2
tokenizer:
  _component_: torchtune.models.qwen2_5.qwen2_5_tokenizer
  max_seq_len: null
  merges_file: /tmp/Qwen2_5-0_5B-Instruct/merges.txt
  path: /tmp/Qwen2_5-0_5B-Instruct/vocab.json

Model is initialized with precision torch.bfloat16.
2025-01-04:12:21:57,949 INFO     [eleuther_eval.py:503] Model is initialized with precision torch.bfloat16.
2025-01-04:12:21:58,119 INFO     [huggingface.py:132] Using device 'cuda:0'
config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 665/665 [00:00<00:00, 4.51MB/s]
tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 26.0/26.0 [00:00<00:00, 188kB/s]
vocab.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.04M/1.04M [00:00<00:00, 4.60MB/s]
merges.txt: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 6.27MB/s]
tokenizer.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.36M/1.36M [00:00<00:00, 6.35MB/s]
2025-01-04:12:22:00,097 INFO     [huggingface.py:369] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'}
model.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 548M/548M [01:09<00:00, 7.93MB/s]
generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 124/124 [00:00<00:00, 914kB/s]
README.md: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.59k/9.59k [00:00<00:00, 73.7MB/s]
validation-00000-of-00001.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 271k/271k [00:00<00:00, 9.24MB/s]
Generating validation split: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 817/817 [00:00<00:00, 43743.65 examples/s]
Running evaluation on the following tasks: ['truthfulqa_mc2']
2025-01-04:12:23:20,152 INFO     [eleuther_eval.py:540] Running evaluation on the following tasks: ['truthfulqa_mc2']
2025-01-04:12:23:20,154 INFO     [task.py:415] Building contexts for truthfulqa_mc2 on rank 0...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 817/817 [00:00<00:00, 1699.19it/s]
2025-01-04:12:23:20,661 INFO     [evaluator.py:496] Running loglikelihood requests
Running loglikelihood requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5882/5882 [08:03<00:00, 12.16it/s]
Eval completed in 486.15 seconds.
2025-01-04:12:31:26,298 INFO     [eleuther_eval.py:549] Eval completed in 486.15 seconds.
Max memory allocated: 10.00 GB
2025-01-04:12:31:26,298 INFO     [eleuther_eval.py:550] Max memory allocated: 10.00 GB


|    Tasks     |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|--------------|------:|------|-----:|------|---|-----:|---|-----:|
|truthfulqa_mc2|      2|none  |     0|acc   |↑  |0.4178|±  |0.0146|


2025-01-04:12:31:26,387 INFO     [eleuther_eval.py:554] 

|    Tasks     |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|--------------|------:|------|-----:|------|---|-----:|---|-----:|
|truthfulqa_mc2|      2|none  |     0|acc   |↑  |0.4178|±  |0.0146|

Jan 04 '25 20:01 Ankur-singh

Hey @joecummings , I would like to pick up Llama3.1. Thanks for the opportunity to contribute.

May 05 '25 17:05 ysurs

Hey @joecummings , I raised a PR : https://github.com/pytorch/torchtune/pull/2763 for LLama3.1. Please take a look as per your convenience.

May 23 '25 18:05 ysurs