[Clean up] Move evaluation configs under model directories
Currently, evaluation.yaml exists under the configs/ directory. To start, we wanted to just showcase this recipes as an example, but it is a core part of the finetuning process and therefore should mirror the pattern we've established for other configs in which they reside under model-specific directories.
The change for each model directory will consist of four steps:
- Copy
evaluation.yamlunder whichever model you are focused upon. - Update the defaults from
llama2to the current model defaults. - Update the _recipe_registry.py to make sure the new YAML file can be found with the following command:
tune run eleuther_eval --config MODEL/evaluation - Put up a PR with output from running the evaluation script. Here's an example for Qwen2: #1809
If there are multiple sizes of model that exist in the directory, select the most commonly used one. This is certainly up for interpretation, but typically ~7B params is standard. We want to give a good example, but there's no need to proliferate configs for every model SIZE.
Checklist:
- [ ] Llama2
- [x] Code-Llama2 (#2209, thanks @ReemaAlzaid)
- [ ] Llama3
- [ ] Llama3.1
- [x] Llama3.2 (#2186, thanks @ReemaAlzaid)
- [x] Llama3.2V
- [x] Mistral (#1829, thanks @Yousof-kayal)
- [x] Phi3 (#1822, thanks @Harthi7)
- [x] Gemma (#1819, thanks @malinjawi)
- [ ] Gemma2
- [x] Qwen2
- [x] Qwen2.5 (#2230, thanks @Ankur-singh)
After all of these are completed, we will deprecate the evaluation.yaml configs in the base configs directory.
Thanks, everyone, for your help! 🎉
hey @joecummings I am new to torchtune but I am interested in contributing to this issue.
edit: I will be working on Gemma for now
Hello @joecummings I'm a new dev and am starting out with open source contribution. I wanted to let you know that I've reviewed the problem you've reported and I'll do my best addressing it.
Model: Mistral
Thanks so much @malinjawi and @Yousof-kayal - Can you update your comments with which models you're planning on addressing?
Happy to review any PRs you put up :)
Hello joecummings, I am also working on this issue and I plan to work on Phi3
@joecummings To run the last command (tune run eleuther_eval --config MODEL/evaluation), I had to modify the values for checkpoint_dir, output_dir, and path in the YAML file to match my file locations. Should I revert them back to /tmp/Phi-3-mini-4k-instruct?
@joecummings To run the last command (
tune run eleuther_eval --config MODEL/evaluation), I had to modify the values forcheckpoint_dir,output_dir, andpathin the YAML file to match my file locations. Should I revert them back to/tmp/Phi-3-mini-4k-instruct?
yes please use the default model locations for now!
@joecummings To run the last command (
tune run eleuther_eval --config MODEL/evaluation), I had to modify the values forcheckpoint_dir,output_dir, andpathin the YAML file to match my file locations. Should I revert them back to/tmp/Phi-3-mini-4k-instruct?yes please use the default model locations for now!
hey @joecummings, I also noticed that i needed to change the model location when running the command. I was able to run it tho is this good for a PR? Do you have any suggestions if we on the wrong track?
(torchtune_rosetta) linjaboy@Mohammads-MacBook-Pro 9cf48e52b224239de00d483ec8eb84fb8d0f3a3a % tune run eleuther_eval --config gemma/evaluation
W1012 01:59:27.011000 8088854208 torch/distributed/elastic/multiprocessing/redirects.py:28] NOTE: Redirects are currently not supported in Windows or MacOs.
INFO:torchtune.utils._logging:Running EleutherEvalRecipe with resolved config:
batch_size: 8
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /tmp/gemma-2b/models--google--gemma-2b/snapshots/9cf48e52b224239de00d483ec8eb84fb8d0f3a3a
checkpoint_files:
- model-00001-of-00002.safetensors
- model-00002-of-00002.safetensors
model_type: GEMMA
output_dir: ./
device: cpu
dtype: bf16
enable_kv_cache: true
limit: null
max_seq_length: 4096
model:
_component_: torchtune.models.gemma.gemma_2b
quantizer: null
seed: 1234
tasks:
- truthfulqa_mc2
tokenizer:
_component_: torchtune.models.gemma.gemma_tokenizer
path: /tmp/gemma-2b/models--google--gemma-2b/snapshots/9cf48e52b224239de00d483ec8eb84fb8d0f3a3a/tokenizer.model
INFO:torchtune.utils._logging:Model is initialized with precision torch.bfloat16.
config.json: 100%|███████████████████████████████████████████████████████████████████████████| 665/665 [00:00<00:00, 1.06MB/s]
tokenizer_config.json: 100%|████████████████████████████████████████████████████████████████| 26.0/26.0 [00:00<00:00, 119kB/s]
vocab.json: 100%|████████████████████████████████████████████████████████████████████████| 1.04M/1.04M [00:00<00:00, 1.14MB/s]
merges.txt: 100%|██████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 1.57MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████| 1.36M/1.36M [00:00<00:00, 2.14MB/s]
model.safetensors: 100%|███████████████████████████████████████████████████████████████████| 548M/548M [01:32<00:00, 5.95MB/s]
generation_config.json: 100%|█████████████████████████████████████████████████████████████████| 124/124 [00:00<00:00, 712kB/s]
README.md: 100%|█████████████████████████████████████████████████████████████████████████| 9.59k/9.59k [00:00<00:00, 5.65MB/s]
validation-00000-of-00001.parquet: 100%|███████████████████████████████████████████████████| 271k/271k [00:00<00:00, 2.79MB/s]
Generating validation split: 100%|████████████████████████████████████████████████| 817/817 [00:00<00:00, 26416.89 examples/s]
INFO:torchtune.utils._logging:Running evaluation on the following tasks: ['truthfulqa_mc2']
INFO:lm-eval:Building contexts for truthfulqa_mc2 on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████| 817/817 [00:00<00:00, 2121.83it/s]
INFO:lm-eval:Running loglikelihood requests
Running loglikelihood requests: 100%|███████████████████████████████████████████████████| 5882/5882 [8:26:40<00:00, 5.17s/it]
INFO:torchtune.utils._logging:Eval completed in 30404.48 seconds.
INFO:torchtune.utils._logging:Max memory allocated: 0.00 GB
INFO:torchtune.utils._logging:
| Tasks |Version|Filter|n-shot|Metric| |Value | |Stderr|
|--------------|------:|------|-----:|------|---|-----:|---|-----:|
|truthfulqa_mc2| 2|none | 0|acc |↑ |0.3995|± |0.0152|
This the log after I ran the command tune run eleuther_eval --config MODEL/evaluation
(torchtune) Abdullahs-MacBook-Pro:Phi-3-mini-4k-instruct abdullah$ tune run eleuther_eval --config phi3/evaluation
W1012 00:32:22.842000 8517586752 torch/distributed/elastic/multiprocessing/redirects.py:28] NOTE: Redirects are currently not supported in Windows or MacOs.
INFO:torchtune.utils._logging:Running EleutherEvalRecipe with resolved config:
batch_size: 8
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /tmp/Phi-3-mini-4k-instruct/models--microsoft--Phi-3-mini-4k-instruct/snapshots/0a67737cc96d2554230f90338b163bc6380a2a85
checkpoint_files:
- model-00001-of-00002.safetensors
- model-00002-of-00002.safetensors
model_type: PHI3_MINI
output_dir: /tmp/Phi-3-mini-4k-instruct/models--microsoft--Phi-3-mini-4k-instruct/snapshots/0a67737cc96d2554230f90338b163bc6380a2a85
recipe_checkpoint: null
device: cpu
dtype: bf16
enable_kv_cache: true
limit: null
max_seq_length: 4096
model:
_component_: torchtune.models.phi3.phi3_mini
quantizer: null
resume_from_checkpoint: false
seed: 1234
tasks:
- truthfulqa_mc2
tokenizer:
_component_: torchtune.models.phi3.phi3_mini_tokenizer
max_seq_len: null
path: /tmp/Phi-3-mini-4k-instruct/tokenizer.model
INFO:torchtune.utils._logging:Converting Phi-3 Mini weights from HF format.Note that conversion of adapter weights into PEFT format is not supported.
INFO:torchtune.utils._logging:Model is initialized with precision torch.bfloat16.
config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 665/665 [00:00<00:00, 769kB/s]
tokenizer_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 26.0/26.0 [00:00<00:00, 106kB/s]
vocab.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.04M/1.04M [00:00<00:00, 1.16MB/s]
merges.txt: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 1.26MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.36M/1.36M [00:00<00:00, 1.90MB/s]
model.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 548M/548M [03:43<00:00, 2.45MB/s]
generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 124/124 [00:00<00:00, 2.14MB/s]
README.md: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.59k/9.59k [00:00<00:00, 16.1MB/s]
validation-00000-of-00001.parquet: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 271k/271k [00:00<00:00, 1.36MB/s]
Generating validation split: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 817/817 [00:00<00:00, 45492.21 examples/s]
INFO:torchtune.utils._logging:Running evaluation on the following tasks: ['truthfulqa_mc2']
INFO:lm-eval:Building contexts for truthfulqa_mc2 on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 817/817 [00:00<00:00, 2446.11it/s]
INFO:lm-eval:Running loglikelihood requests
Running loglikelihood requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 5882/5882 [22:42:29<00:00, 13.90s/it]
INFO:torchtune.utils._logging:Eval completed in 81751.63 seconds.
INFO:torchtune.utils._logging:Max memory allocated: 0.00 GB
INFO:torchtune.utils._logging:
| Tasks |Version|Filter|n-shot|Metric| |Value | |Stderr|
|--------------|------:|------|-----:|------|---|-----:|---|-----:|
|truthfulqa_mc2| 2|none | 0|acc |↑ |0.5456|± |0.0151|
I have made the changes and ran tune run eleuther_eval --config MODEL/evaluation. The command runs, but since my machine does not meet the performance requirements, it kills the evaluation early.
Is it okay to PR or should I test with another machine? @joecummings
I have made the changes and ran
tune run eleuther_eval --config MODEL/evaluation. The command runs, but since my machine does not meet the performance requirements, it kills the evaluation early.Is it okay to PR or should I test with another machine? @joecummings
Go ahead and put up the PR and I'll make sure to double check on a beefier machine.
Hey @joecummings, I'd like to work on Llama3.2 and possibly Code-Llama2. For each model, should I evaluate both parameter variants (e.g, Llama3.2 has 1B and 3B parameters)? Or is there a specific focus on one configuration for evaluation?
Hey @joecummings, I'd like to work on Llama3.2 and possibly Code-Llama2. For each model, should I evaluate both parameter variants (e.g, Llama3.2 has 1B and 3B parameters)? Or is there a specific focus on one configuration for evaluation?
Hey @ReemaAlzaid - sorry for the late reply!! Would love for you to work on this. For Llama3.2, please just create an evaluation config for the 3B model.
Hey @joecummings Kindly review this PR https://github.com/pytorch/torchtune/pull/2186 for the Llama3.2 3B eval config. I'm more than happy to take on the remaining models and submit a PR for them
Hey @joecummings Kindly review this PR #2186 for the Llama3.2 3B eval config. I'm more than happy to take on the remaining models and submit a PR for them
Approved! Would love for you to take on the remainder - tag me on any reviews you need :)
Hello @joecummings, I'm planning to work on Qwen2.5 model.
Model: Qwen2.5
Hey @joecummings, here is the output for tune run eleuther_eval --config qwen2_5/evaluation
tune run eleuther_eval --config qwen2_5/evaluation
Running EleutherEvalRecipe with resolved config:
batch_size: 8
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /tmp/Qwen2_5-0_5B-Instruct
checkpoint_files:
- model.safetensors
model_type: QWEN2
output_dir: ./
device: cuda
dtype: bf16
enable_kv_cache: true
limit: null
max_seq_length: 4096
model:
_component_: torchtune.models.qwen2_5.qwen2_5_0_5b
output_dir: ./
quantizer: null
seed: 1234
tasks:
- truthfulqa_mc2
tokenizer:
_component_: torchtune.models.qwen2_5.qwen2_5_tokenizer
max_seq_len: null
merges_file: /tmp/Qwen2_5-0_5B-Instruct/merges.txt
path: /tmp/Qwen2_5-0_5B-Instruct/vocab.json
2025-01-04:12:21:56,964 INFO [_utils.py:28] Running EleutherEvalRecipe with resolved config:
batch_size: 8
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /tmp/Qwen2_5-0_5B-Instruct
checkpoint_files:
- model.safetensors
model_type: QWEN2
output_dir: ./
device: cuda
dtype: bf16
enable_kv_cache: true
limit: null
max_seq_length: 4096
model:
_component_: torchtune.models.qwen2_5.qwen2_5_0_5b
output_dir: ./
quantizer: null
seed: 1234
tasks:
- truthfulqa_mc2
tokenizer:
_component_: torchtune.models.qwen2_5.qwen2_5_tokenizer
max_seq_len: null
merges_file: /tmp/Qwen2_5-0_5B-Instruct/merges.txt
path: /tmp/Qwen2_5-0_5B-Instruct/vocab.json
Model is initialized with precision torch.bfloat16.
2025-01-04:12:21:57,949 INFO [eleuther_eval.py:503] Model is initialized with precision torch.bfloat16.
2025-01-04:12:21:58,119 INFO [huggingface.py:132] Using device 'cuda:0'
config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 665/665 [00:00<00:00, 4.51MB/s]
tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 26.0/26.0 [00:00<00:00, 188kB/s]
vocab.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.04M/1.04M [00:00<00:00, 4.60MB/s]
merges.txt: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 6.27MB/s]
tokenizer.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.36M/1.36M [00:00<00:00, 6.35MB/s]
2025-01-04:12:22:00,097 INFO [huggingface.py:369] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'}
model.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 548M/548M [01:09<00:00, 7.93MB/s]
generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 124/124 [00:00<00:00, 914kB/s]
README.md: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.59k/9.59k [00:00<00:00, 73.7MB/s]
validation-00000-of-00001.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 271k/271k [00:00<00:00, 9.24MB/s]
Generating validation split: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 817/817 [00:00<00:00, 43743.65 examples/s]
Running evaluation on the following tasks: ['truthfulqa_mc2']
2025-01-04:12:23:20,152 INFO [eleuther_eval.py:540] Running evaluation on the following tasks: ['truthfulqa_mc2']
2025-01-04:12:23:20,154 INFO [task.py:415] Building contexts for truthfulqa_mc2 on rank 0...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 817/817 [00:00<00:00, 1699.19it/s]
2025-01-04:12:23:20,661 INFO [evaluator.py:496] Running loglikelihood requests
Running loglikelihood requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5882/5882 [08:03<00:00, 12.16it/s]
Eval completed in 486.15 seconds.
2025-01-04:12:31:26,298 INFO [eleuther_eval.py:549] Eval completed in 486.15 seconds.
Max memory allocated: 10.00 GB
2025-01-04:12:31:26,298 INFO [eleuther_eval.py:550] Max memory allocated: 10.00 GB
| Tasks |Version|Filter|n-shot|Metric| |Value | |Stderr|
|--------------|------:|------|-----:|------|---|-----:|---|-----:|
|truthfulqa_mc2| 2|none | 0|acc |↑ |0.4178|± |0.0146|
2025-01-04:12:31:26,387 INFO [eleuther_eval.py:554]
| Tasks |Version|Filter|n-shot|Metric| |Value | |Stderr|
|--------------|------:|------|-----:|------|---|-----:|---|-----:|
|truthfulqa_mc2| 2|none | 0|acc |↑ |0.4178|± |0.0146|
Hey @joecummings , I would like to pick up Llama3.1. Thanks for the opportunity to contribute.
Hey @joecummings , I raised a PR : https://github.com/pytorch/torchtune/pull/2763 for LLama3.1. Please take a look as per your convenience.