torchtune icon indicating copy to clipboard operation
torchtune copied to clipboard

[Clean up] Move evaluation configs under model directories

Open joecummings opened this issue 1 year ago • 18 comments

Currently, evaluation.yaml exists under the configs/ directory. To start, we wanted to just showcase this recipes as an example, but it is a core part of the finetuning process and therefore should mirror the pattern we've established for other configs in which they reside under model-specific directories.

The change for each model directory will consist of four steps:

  1. Copy evaluation.yaml under whichever model you are focused upon.
  2. Update the defaults from llama2 to the current model defaults.
  3. Update the _recipe_registry.py to make sure the new YAML file can be found with the following command: tune run eleuther_eval --config MODEL/evaluation
  4. Put up a PR with output from running the evaluation script. Here's an example for Qwen2: #1809

If there are multiple sizes of model that exist in the directory, select the most commonly used one. This is certainly up for interpretation, but typically ~7B params is standard. We want to give a good example, but there's no need to proliferate configs for every model SIZE.

Checklist:

  • [ ] Llama2
  • [x] Code-Llama2 (#2209, thanks @ReemaAlzaid)
  • [ ] Llama3
  • [ ] Llama3.1
  • [x] Llama3.2 (#2186, thanks @ReemaAlzaid)
  • [x] Llama3.2V
  • [x] Mistral (#1829, thanks @Yousof-kayal)
  • [x] Phi3 (#1822, thanks @Harthi7)
  • [x] Gemma (#1819, thanks @malinjawi)
  • [ ] Gemma2
  • [x] Qwen2
  • [x] Qwen2.5 (#2230, thanks @Ankur-singh)

After all of these are completed, we will deprecate the evaluation.yaml configs in the base configs directory.

Thanks, everyone, for your help! 🎉

joecummings avatar Oct 11 '24 13:10 joecummings

hey @joecummings I am new to torchtune but I am interested in contributing to this issue.

edit: I will be working on Gemma for now

malinjawi avatar Oct 11 '24 15:10 malinjawi

Hello @joecummings I'm a new dev and am starting out with open source contribution. I wanted to let you know that I've reviewed the problem you've reported and I'll do my best addressing it.

Model: Mistral

Yousof-kayal avatar Oct 11 '24 15:10 Yousof-kayal

Thanks so much @malinjawi and @Yousof-kayal - Can you update your comments with which models you're planning on addressing?

Happy to review any PRs you put up :)

joecummings avatar Oct 11 '24 17:10 joecummings

Hello joecummings, I am also working on this issue and I plan to work on Phi3

Harthi7 avatar Oct 11 '24 17:10 Harthi7

@joecummings To run the last command (tune run eleuther_eval --config MODEL/evaluation), I had to modify the values for checkpoint_dir, output_dir, and path in the YAML file to match my file locations. Should I revert them back to /tmp/Phi-3-mini-4k-instruct?

Harthi7 avatar Oct 11 '24 22:10 Harthi7

@joecummings To run the last command (tune run eleuther_eval --config MODEL/evaluation), I had to modify the values for checkpoint_dir, output_dir, and path in the YAML file to match my file locations. Should I revert them back to /tmp/Phi-3-mini-4k-instruct?

yes please use the default model locations for now!

joecummings avatar Oct 12 '24 10:10 joecummings

@joecummings To run the last command (tune run eleuther_eval --config MODEL/evaluation), I had to modify the values for checkpoint_dir, output_dir, and path in the YAML file to match my file locations. Should I revert them back to /tmp/Phi-3-mini-4k-instruct?

yes please use the default model locations for now!

hey @joecummings, I also noticed that i needed to change the model location when running the command. I was able to run it tho is this good for a PR? Do you have any suggestions if we on the wrong track?

(torchtune_rosetta) linjaboy@Mohammads-MacBook-Pro 9cf48e52b224239de00d483ec8eb84fb8d0f3a3a % tune run eleuther_eval --config gemma/evaluation
W1012 01:59:27.011000 8088854208 torch/distributed/elastic/multiprocessing/redirects.py:28] NOTE: Redirects are currently not supported in Windows or MacOs.
INFO:torchtune.utils._logging:Running EleutherEvalRecipe with resolved config:

batch_size: 8
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /tmp/gemma-2b/models--google--gemma-2b/snapshots/9cf48e52b224239de00d483ec8eb84fb8d0f3a3a
  checkpoint_files:
  - model-00001-of-00002.safetensors
  - model-00002-of-00002.safetensors
  model_type: GEMMA
  output_dir: ./
device: cpu
dtype: bf16
enable_kv_cache: true
limit: null
max_seq_length: 4096
model:
  _component_: torchtune.models.gemma.gemma_2b
quantizer: null
seed: 1234
tasks:
- truthfulqa_mc2
tokenizer:
  _component_: torchtune.models.gemma.gemma_tokenizer
  path: /tmp/gemma-2b/models--google--gemma-2b/snapshots/9cf48e52b224239de00d483ec8eb84fb8d0f3a3a/tokenizer.model

INFO:torchtune.utils._logging:Model is initialized with precision torch.bfloat16.
config.json: 100%|███████████████████████████████████████████████████████████████████████████| 665/665 [00:00<00:00, 1.06MB/s]
tokenizer_config.json: 100%|████████████████████████████████████████████████████████████████| 26.0/26.0 [00:00<00:00, 119kB/s]
vocab.json: 100%|████████████████████████████████████████████████████████████████████████| 1.04M/1.04M [00:00<00:00, 1.14MB/s]
merges.txt: 100%|██████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 1.57MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████| 1.36M/1.36M [00:00<00:00, 2.14MB/s]
model.safetensors: 100%|███████████████████████████████████████████████████████████████████| 548M/548M [01:32<00:00, 5.95MB/s]
generation_config.json: 100%|█████████████████████████████████████████████████████████████████| 124/124 [00:00<00:00, 712kB/s]
README.md: 100%|█████████████████████████████████████████████████████████████████████████| 9.59k/9.59k [00:00<00:00, 5.65MB/s]
validation-00000-of-00001.parquet: 100%|███████████████████████████████████████████████████| 271k/271k [00:00<00:00, 2.79MB/s]
Generating validation split: 100%|████████████████████████████████████████████████| 817/817 [00:00<00:00, 26416.89 examples/s]
INFO:torchtune.utils._logging:Running evaluation on the following tasks: ['truthfulqa_mc2']
INFO:lm-eval:Building contexts for truthfulqa_mc2 on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████| 817/817 [00:00<00:00, 2121.83it/s]
INFO:lm-eval:Running loglikelihood requests
Running loglikelihood requests: 100%|███████████████████████████████████████████████████| 5882/5882 [8:26:40<00:00,  5.17s/it]
INFO:torchtune.utils._logging:Eval completed in 30404.48 seconds.
INFO:torchtune.utils._logging:Max memory allocated: 0.00 GB
INFO:torchtune.utils._logging:

|    Tasks     |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|--------------|------:|------|-----:|------|---|-----:|---|-----:|
|truthfulqa_mc2|      2|none  |     0|acc   |↑  |0.3995|±  |0.0152|



malinjawi avatar Oct 12 '24 11:10 malinjawi

This the log after I ran the command tune run eleuther_eval --config MODEL/evaluation

(torchtune) Abdullahs-MacBook-Pro:Phi-3-mini-4k-instruct abdullah$ tune run eleuther_eval --config phi3/evaluation
W1012 00:32:22.842000 8517586752 torch/distributed/elastic/multiprocessing/redirects.py:28] NOTE: Redirects are currently not supported in Windows or MacOs.
INFO:torchtune.utils._logging:Running EleutherEvalRecipe with resolved config:

batch_size: 8
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /tmp/Phi-3-mini-4k-instruct/models--microsoft--Phi-3-mini-4k-instruct/snapshots/0a67737cc96d2554230f90338b163bc6380a2a85
  checkpoint_files:
  - model-00001-of-00002.safetensors
  - model-00002-of-00002.safetensors
  model_type: PHI3_MINI
  output_dir: /tmp/Phi-3-mini-4k-instruct/models--microsoft--Phi-3-mini-4k-instruct/snapshots/0a67737cc96d2554230f90338b163bc6380a2a85
  recipe_checkpoint: null
device: cpu
dtype: bf16
enable_kv_cache: true
limit: null
max_seq_length: 4096
model:
  _component_: torchtune.models.phi3.phi3_mini
quantizer: null
resume_from_checkpoint: false
seed: 1234
tasks:
- truthfulqa_mc2
tokenizer:
  _component_: torchtune.models.phi3.phi3_mini_tokenizer
  max_seq_len: null
  path: /tmp/Phi-3-mini-4k-instruct/tokenizer.model

INFO:torchtune.utils._logging:Converting Phi-3 Mini weights from HF format.Note that conversion of adapter weights into PEFT format is not supported.
INFO:torchtune.utils._logging:Model is initialized with precision torch.bfloat16.
config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 665/665 [00:00<00:00, 769kB/s]
tokenizer_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 26.0/26.0 [00:00<00:00, 106kB/s]
vocab.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.04M/1.04M [00:00<00:00, 1.16MB/s]
merges.txt: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 1.26MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.36M/1.36M [00:00<00:00, 1.90MB/s]
model.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 548M/548M [03:43<00:00, 2.45MB/s]
generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 124/124 [00:00<00:00, 2.14MB/s]
README.md: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.59k/9.59k [00:00<00:00, 16.1MB/s]
validation-00000-of-00001.parquet: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 271k/271k [00:00<00:00, 1.36MB/s]
Generating validation split: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 817/817 [00:00<00:00, 45492.21 examples/s]
INFO:torchtune.utils._logging:Running evaluation on the following tasks: ['truthfulqa_mc2']
INFO:lm-eval:Building contexts for truthfulqa_mc2 on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 817/817 [00:00<00:00, 2446.11it/s]
INFO:lm-eval:Running loglikelihood requests
Running loglikelihood requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 5882/5882 [22:42:29<00:00, 13.90s/it]
INFO:torchtune.utils._logging:Eval completed in 81751.63 seconds.
INFO:torchtune.utils._logging:Max memory allocated: 0.00 GB
INFO:torchtune.utils._logging:

|    Tasks     |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|--------------|------:|------|-----:|------|---|-----:|---|-----:|
|truthfulqa_mc2|      2|none  |     0|acc   |↑  |0.5456|±  |0.0151|


Harthi7 avatar Oct 12 '24 20:10 Harthi7

I have made the changes and ran tune run eleuther_eval --config MODEL/evaluation. The command runs, but since my machine does not meet the performance requirements, it kills the evaluation early.

Is it okay to PR or should I test with another machine? @joecummings

Yousof-kayal avatar Oct 12 '24 21:10 Yousof-kayal

I have made the changes and ran tune run eleuther_eval --config MODEL/evaluation. The command runs, but since my machine does not meet the performance requirements, it kills the evaluation early.

Is it okay to PR or should I test with another machine? @joecummings

Go ahead and put up the PR and I'll make sure to double check on a beefier machine.

joecummings avatar Oct 14 '24 13:10 joecummings

Hey @joecummings, I'd like to work on Llama3.2 and possibly Code-Llama2. For each model, should I evaluate both parameter variants (e.g, Llama3.2 has 1B and 3B parameters)? Or is there a specific focus on one configuration for evaluation?

ReemaAlzaid avatar Nov 16 '24 22:11 ReemaAlzaid

Hey @joecummings, I'd like to work on Llama3.2 and possibly Code-Llama2. For each model, should I evaluate both parameter variants (e.g, Llama3.2 has 1B and 3B parameters)? Or is there a specific focus on one configuration for evaluation?

Hey @ReemaAlzaid - sorry for the late reply!! Would love for you to work on this. For Llama3.2, please just create an evaluation config for the 3B model.

joecummings avatar Nov 18 '24 16:11 joecummings

Hey @joecummings Kindly review this PR https://github.com/pytorch/torchtune/pull/2186 for the Llama3.2 3B eval config. I'm more than happy to take on the remaining models and submit a PR for them

ReemaAlzaid avatar Dec 20 '24 17:12 ReemaAlzaid

Hey @joecummings Kindly review this PR #2186 for the Llama3.2 3B eval config. I'm more than happy to take on the remaining models and submit a PR for them

Approved! Would love for you to take on the remainder - tag me on any reviews you need :)

joecummings avatar Dec 20 '24 17:12 joecummings

Hello @joecummings, I'm planning to work on Qwen2.5 model.

Model: Qwen2.5

Ankur-singh avatar Jan 03 '25 15:01 Ankur-singh

Hey @joecummings, here is the output for tune run eleuther_eval --config qwen2_5/evaluation

tune run eleuther_eval --config qwen2_5/evaluation
Running EleutherEvalRecipe with resolved config:

batch_size: 8
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /tmp/Qwen2_5-0_5B-Instruct
  checkpoint_files:
  - model.safetensors
  model_type: QWEN2
  output_dir: ./
device: cuda
dtype: bf16
enable_kv_cache: true
limit: null
max_seq_length: 4096
model:
  _component_: torchtune.models.qwen2_5.qwen2_5_0_5b
output_dir: ./
quantizer: null
seed: 1234
tasks:
- truthfulqa_mc2
tokenizer:
  _component_: torchtune.models.qwen2_5.qwen2_5_tokenizer
  max_seq_len: null
  merges_file: /tmp/Qwen2_5-0_5B-Instruct/merges.txt
  path: /tmp/Qwen2_5-0_5B-Instruct/vocab.json

2025-01-04:12:21:56,964 INFO     [_utils.py:28] Running EleutherEvalRecipe with resolved config:

batch_size: 8
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /tmp/Qwen2_5-0_5B-Instruct
  checkpoint_files:
  - model.safetensors
  model_type: QWEN2
  output_dir: ./
device: cuda
dtype: bf16
enable_kv_cache: true
limit: null
max_seq_length: 4096
model:
  _component_: torchtune.models.qwen2_5.qwen2_5_0_5b
output_dir: ./
quantizer: null
seed: 1234
tasks:
- truthfulqa_mc2
tokenizer:
  _component_: torchtune.models.qwen2_5.qwen2_5_tokenizer
  max_seq_len: null
  merges_file: /tmp/Qwen2_5-0_5B-Instruct/merges.txt
  path: /tmp/Qwen2_5-0_5B-Instruct/vocab.json

Model is initialized with precision torch.bfloat16.
2025-01-04:12:21:57,949 INFO     [eleuther_eval.py:503] Model is initialized with precision torch.bfloat16.
2025-01-04:12:21:58,119 INFO     [huggingface.py:132] Using device 'cuda:0'
config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 665/665 [00:00<00:00, 4.51MB/s]
tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 26.0/26.0 [00:00<00:00, 188kB/s]
vocab.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.04M/1.04M [00:00<00:00, 4.60MB/s]
merges.txt: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 6.27MB/s]
tokenizer.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.36M/1.36M [00:00<00:00, 6.35MB/s]
2025-01-04:12:22:00,097 INFO     [huggingface.py:369] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'}
model.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 548M/548M [01:09<00:00, 7.93MB/s]
generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 124/124 [00:00<00:00, 914kB/s]
README.md: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.59k/9.59k [00:00<00:00, 73.7MB/s]
validation-00000-of-00001.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 271k/271k [00:00<00:00, 9.24MB/s]
Generating validation split: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 817/817 [00:00<00:00, 43743.65 examples/s]
Running evaluation on the following tasks: ['truthfulqa_mc2']
2025-01-04:12:23:20,152 INFO     [eleuther_eval.py:540] Running evaluation on the following tasks: ['truthfulqa_mc2']
2025-01-04:12:23:20,154 INFO     [task.py:415] Building contexts for truthfulqa_mc2 on rank 0...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 817/817 [00:00<00:00, 1699.19it/s]
2025-01-04:12:23:20,661 INFO     [evaluator.py:496] Running loglikelihood requests
Running loglikelihood requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5882/5882 [08:03<00:00, 12.16it/s]
Eval completed in 486.15 seconds.
2025-01-04:12:31:26,298 INFO     [eleuther_eval.py:549] Eval completed in 486.15 seconds.
Max memory allocated: 10.00 GB
2025-01-04:12:31:26,298 INFO     [eleuther_eval.py:550] Max memory allocated: 10.00 GB


|    Tasks     |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|--------------|------:|------|-----:|------|---|-----:|---|-----:|
|truthfulqa_mc2|      2|none  |     0|acc   |↑  |0.4178|±  |0.0146|


2025-01-04:12:31:26,387 INFO     [eleuther_eval.py:554] 

|    Tasks     |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|--------------|------:|------|-----:|------|---|-----:|---|-----:|
|truthfulqa_mc2|      2|none  |     0|acc   |↑  |0.4178|±  |0.0146|

Ankur-singh avatar Jan 04 '25 20:01 Ankur-singh

Hey @joecummings , I would like to pick up Llama3.1. Thanks for the opportunity to contribute.

ysurs avatar May 05 '25 17:05 ysurs

Hey @joecummings , I raised a PR : https://github.com/pytorch/torchtune/pull/2763 for LLama3.1. Please take a look as per your convenience.

ysurs avatar May 23 '25 18:05 ysurs