torchtune
torchtune copied to clipboard
Toggling KV-caches
Context
What is the purpose of this PR? Is it to
- [x] add a new feature
- [x] fix a bug
- [ ] update tests and/or documentation
- [ ] other (please add here)
Please link to any issues this PR addresses.
closes #1621 RFC here #1675
Multimodal eval results
On main
root@736fb59b1bb9:~/torchtune# tune run eleuther_eval --config llama3_2_vision/evaluation limit=5 max_seq_length=2048
Running EleutherEvalRecipe with resolved config:
batch_size: 1
checkpointer:
_component_: torchtune.training.FullModelMetaCheckpointer
checkpoint_dir: /tmp/Llama-3.2-11B-Vision-Instruct/original
checkpoint_files:
- consolidated.pth
model_type: LLAMA3_VISION
output_dir: ./
device: cuda
dtype: bf16
enable_kv_cache: true
limit: 5
log_level: INFO
max_seq_length: 2048
model:
_component_: torchtune.models.llama3_2_vision.llama3_2_vision_11b
quantizer: null
seed: 1234
tasks:
- mmmu_val_science
tokenizer:
_component_: torchtune.models.llama3_2_vision.llama3_2_vision_transform
max_seq_len: 8192
path: /tmp/Llama-3.2-11B-Vision-Instruct/original/tokenizer.model
2024-10-09:15:51:36,147 INFO [_logging.py:101] Running EleutherEvalRecipe with resolved config:
batch_size: 1
checkpointer:
_component_: torchtune.training.FullModelMetaCheckpointer
checkpoint_dir: /tmp/Llama-3.2-11B-Vision-Instruct/original
checkpoint_files:
- consolidated.pth
model_type: LLAMA3_VISION
output_dir: ./
device: cuda
dtype: bf16
enable_kv_cache: true
limit: 5
log_level: INFO
max_seq_length: 2048
model:
_component_: torchtune.models.llama3_2_vision.llama3_2_vision_11b
quantizer: null
seed: 1234
tasks:
- mmmu_val_science
tokenizer:
_component_: torchtune.models.llama3_2_vision.llama3_2_vision_transform
max_seq_len: 8192
path: /tmp/Llama-3.2-11B-Vision-Instruct/original/tokenizer.model
Model is initialized with precision torch.bfloat16.
2024-10-09:15:51:39,457 INFO [eleuther_eval.py:500] Model is initialized with precision torch.bfloat16.
Running evaluation on the following tasks: ['mmmu_val_science']
2024-10-09:15:51:51,298 INFO [eleuther_eval.py:549] Running evaluation on the following tasks: ['mmmu_val_science']
2024-10-09:15:51:51,302 INFO [task.py:415] Building contexts for mmmu_val_biology on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 13001.56it/s]
2024-10-09:15:51:51,338 INFO [task.py:415] Building contexts for mmmu_val_chemistry on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 20846.44it/s]
2024-10-09:15:51:51,342 INFO [task.py:415] Building contexts for mmmu_val_geography on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 20846.44it/s]
2024-10-09:15:51:51,359 INFO [task.py:415] Building contexts for mmmu_val_math on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 14246.96it/s]
2024-10-09:15:51:51,377 INFO [task.py:415] Building contexts for mmmu_val_physics on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 21822.60it/s]
2024-10-09:15:51:51,392 INFO [evaluator.py:489] Running generate_until requests
Running generate_until requests with text+image input: 100%|███████████████████████████████████████████| 25/25 [04:34<00:00, 10.98s/it]
Eval completed in 274.73 seconds.
2024-10-09:15:56:26,028 INFO [eleuther_eval.py:558] Eval completed in 274.73 seconds.
Max memory allocated: 32.86 GB
2024-10-09:15:56:26,028 INFO [eleuther_eval.py:559] Max memory allocated: 32.86 GB
| Tasks |Version|Filter|n-shot|Metric| |Value| |Stderr|
|------------|------:|------|------|------|---|----:|---|-----:|
|Science | 0|none | |acc |↑ | 0.32|± |0.0938|
| - Biology | 0|none |None |acc |↑ | 0.20|± |0.2000|
| - Chemistry| 0|none |None |acc |↑ | 0.00|± |0.0000|
| - Geography| 0|none |None |acc |↑ | 0.40|± |0.2449|
| - Math | 0|none |None |acc |↑ | 0.40|± |0.2449|
| - Physics | 0|none |None |acc |↑ | 0.60|± |0.2449|
2024-10-09:15:56:26,086 INFO [eleuther_eval.py:563]
| Tasks |Version|Filter|n-shot|Metric| |Value| |Stderr|
|------------|------:|------|------|------|---|----:|---|-----:|
|Science | 0|none | |acc |↑ | 0.32|± |0.0938|
| - Biology | 0|none |None |acc |↑ | 0.20|± |0.2000|
| - Chemistry| 0|none |None |acc |↑ | 0.00|± |0.0000|
| - Geography| 0|none |None |acc |↑ | 0.40|± |0.2449|
| - Math | 0|none |None |acc |↑ | 0.40|± |0.2449|
| - Physics | 0|none |None |acc |↑ | 0.60|± |0.2449|
On this branch
root@736fb59b1bb9:~/torchtune# tune run eleuther_eval --config llama3_2_vision/evaluation limit=5 max_seq_length=2048
Running EleutherEvalRecipe with resolved config:
batch_size: 1
checkpointer:
_component_: torchtune.training.FullModelMetaCheckpointer
checkpoint_dir: /tmp/Llama-3.2-11B-Vision-Instruct/original
checkpoint_files:
- consolidated.pth
model_type: LLAMA3_VISION
output_dir: ./
device: cuda
dtype: bf16
enable_kv_cache: true
limit: 5
log_level: INFO
max_seq_length: 2048
model:
_component_: torchtune.models.llama3_2_vision.llama3_2_vision_11b
quantizer: null
seed: 1234
tasks:
- mmmu_val_science
tokenizer:
_component_: torchtune.models.llama3_2_vision.llama3_2_vision_transform
max_seq_len: 8192
path: /tmp/Llama-3.2-11B-Vision-Instruct/original/tokenizer.model
2024-10-09:15:45:53,771 INFO [_logging.py:101] Running EleutherEvalRecipe with resolved config:
batch_size: 1
checkpointer:
_component_: torchtune.training.FullModelMetaCheckpointer
checkpoint_dir: /tmp/Llama-3.2-11B-Vision-Instruct/original
checkpoint_files:
- consolidated.pth
model_type: LLAMA3_VISION
output_dir: ./
device: cuda
dtype: bf16
enable_kv_cache: true
limit: 5
log_level: INFO
max_seq_length: 2048
model:
_component_: torchtune.models.llama3_2_vision.llama3_2_vision_11b
quantizer: null
seed: 1234
tasks:
- mmmu_val_science
tokenizer:
_component_: torchtune.models.llama3_2_vision.llama3_2_vision_transform
max_seq_len: 8192
path: /tmp/Llama-3.2-11B-Vision-Instruct/original/tokenizer.model
Model is initialized with precision torch.bfloat16.
2024-10-09:15:45:57,532 INFO [eleuther_eval.py:500] Model is initialized with precision torch.bfloat16.
Running evaluation on the following tasks: ['mmmu_val_science']
2024-10-09:15:46:10,590 INFO [eleuther_eval.py:549] Running evaluation on the following tasks: ['mmmu_val_science']
2024-10-09:15:46:10,594 INFO [task.py:415] Building contexts for mmmu_val_biology on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 12336.19it/s]
2024-10-09:15:46:10,631 INFO [task.py:415] Building contexts for mmmu_val_chemistry on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 20301.57it/s]
2024-10-09:15:46:10,635 INFO [task.py:415] Building contexts for mmmu_val_geography on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 15911.62it/s]
2024-10-09:15:46:10,653 INFO [task.py:415] Building contexts for mmmu_val_math on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 13626.72it/s]
2024-10-09:15:46:10,671 INFO [task.py:415] Building contexts for mmmu_val_physics on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 20223.26it/s]
2024-10-09:15:46:10,687 INFO [evaluator.py:489] Running generate_until requests
Running generate_until requests with text+image input: 100%|███████████████████████████████████████████| 25/25 [04:33<00:00, 10.92s/it]
Eval completed in 273.26 seconds.
2024-10-09:15:50:43,852 INFO [eleuther_eval.py:558] Eval completed in 273.26 seconds.
Max memory allocated: 32.86 GB
2024-10-09:15:50:43,852 INFO [eleuther_eval.py:559] Max memory allocated: 32.86 GB
| Tasks |Version|Filter|n-shot|Metric| |Value| |Stderr|
|------------|------:|------|------|------|---|----:|---|-----:|
|Science | 0|none | |acc |↑ | 0.32|± |0.0938|
| - Biology | 0|none |None |acc |↑ | 0.20|± |0.2000|
| - Chemistry| 0|none |None |acc |↑ | 0.00|± |0.0000|
| - Geography| 0|none |None |acc |↑ | 0.40|± |0.2449|
| - Math | 0|none |None |acc |↑ | 0.40|± |0.2449|
| - Physics | 0|none |None |acc |↑ | 0.60|± |0.2449|
2024-10-09:15:50:43,909 INFO [eleuther_eval.py:563]
| Tasks |Version|Filter|n-shot|Metric| |Value| |Stderr|
|------------|------:|------|------|------|---|----:|---|-----:|
|Science | 0|none | |acc |↑ | 0.32|± |0.0938|
| - Biology | 0|none |None |acc |↑ | 0.20|± |0.2000|
| - Chemistry| 0|none |None |acc |↑ | 0.00|± |0.0000|
| - Geography| 0|none |None |acc |↑ | 0.40|± |0.2449|
| - Math | 0|none |None |acc |↑ | 0.40|± |0.2449|
| - Physics | 0|none |None |acc |↑ | 0.60|± |0.2449|
Text eval results
(tune) salman@combuter:~/torchtune$ tune run eleuther_eval --config target/eleuther_evaluation.yaml
2024-10-08:21:18:07,202 INFO [_logging.py:101] Running EleutherEvalRecipe with resolved config:
batch_size: 1
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: ./target/1b_normal
checkpoint_files:
- pytorch_model.bin
model_type: LLAMA2
output_dir: ./target/tmp
device: cuda
dtype: bf16
enable_kv_cache: true
limit: 20
max_seq_length: 1024
model:
_component_: torchtune.models.llama2.llama2
embed_dim: 2048
max_seq_len: 4096
norm_eps: 1.0e-05
num_heads: 32
num_kv_heads: 4
num_layers: 22
vocab_size: 32000
quantizer: null
seed: 1234
tasks:
- truthfulqa_gen
- truthfulqa_mc2
tokenizer:
_component_: torchtune.models.llama2.llama2_tokenizer
path: ./target/1b_normal/tokenizer.model
2024-10-08:21:18:08,854 INFO [eleuther_eval.py:495] Model is initialized with precision torch.bfloat16.
2024-10-08:21:18:08,879 INFO [huggingface.py:132] Using device 'cuda:0'
/home/salman/.pyenv/versions/3.11.9/envs/tune/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
2024-10-08:21:18:09,228 INFO [huggingface.py:368] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'}
2024-10-08:21:18:10,532 INFO [__init__.py:491] `group` and `group_alias` keys in TaskConfigs are deprecated and will be removed in v0.4.5 of lm_eval. The new `tag` field will be used to allow for a shortcut to a group of tasks one does not wish to aggregate metrics across. `group`s which aggregate across subtasks must be only defined in a separate group config file, which will be the official way to create groups that support cross-task aggregation as in `mmlu`. Please see the v0.4.4 patch notes and our documentation: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#advanced-group-configs for more information.
2024-10-08:21:18:23,103 INFO [eleuther_eval.py:537] Running evaluation on the following tasks: ['truthfulqa_gen', 'truthfulqa_mc2']
2024-10-08:21:18:23,106 INFO [task.py:428] Building contexts for truthfulqa_mc2 on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 903.93it/s]
2024-10-08:21:18:23,130 INFO [task.py:428] Building contexts for truthfulqa_gen on rank 0...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 1604.19it/s]
2024-10-08:21:18:23,147 INFO [evaluator.py:485] Running loglikelihood requests
Running loglikelihood requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 153/153 [00:18<00:00, 8.37it/s]
2024-10-08:21:18:41,510 INFO [evaluator.py:485] Running generate_until requests
Running generate_until requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [02:48<00:00, 8.43s/it]
2024-10-08:21:21:30,179 INFO [rouge_scorer.py:83] Using default tokenizer.
2024-10-08:21:21:49,314 INFO [eleuther_eval.py:546] Eval completed in 206.21 seconds.
2024-10-08:21:21:49,314 INFO [eleuther_eval.py:547] Max memory allocated: 3.41 GB
2024-10-08:21:21:49,432 INFO [eleuther_eval.py:551]
| Tasks |Version|Filter|n-shot| Metric | | Value | |Stderr|
|--------------|------:|------|-----:|-----------|---|-------:|---|-----:|
|truthfulqa_gen| 3|none | 0|bleu_acc |↑ | 0.3000|± |0.1051|
| | |none | 0|bleu_diff |↑ |-11.1260|± |3.5818|
| | |none | 0|bleu_max |↑ | 23.0317|± |4.5452|
| | |none | 0|rouge1_acc |↑ | 0.4500|± |0.1141|
| | |none | 0|rouge1_diff|↑ | -6.7262|± |3.7765|
| | |none | 0|rouge1_max |↑ | 49.9555|± |4.9504|
| | |none | 0|rouge2_acc |↑ | 0.3000|± |0.1051|
| | |none | 0|rouge2_diff|↑ |-12.0537|± |4.2492|
| | |none | 0|rouge2_max |↑ | 31.7591|± |6.2387|
| | |none | 0|rougeL_acc |↑ | 0.3500|± |0.1094|
| | |none | 0|rougeL_diff|↑ | -8.0059|± |3.4676|
| | |none | 0|rougeL_max |↑ | 47.6440|± |5.3441|
|truthfulqa_mc2| 2|none | 0|acc |↑ | 0.4769|± |0.0947|
Test plan
Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.
- [ ] run pre-commit hooks and linters (make sure you've first installed via
pre-commit install
) - [ ] add unit tests for any new functionality
- [ ] update docstrings for any new or updated methods or classes
- [ ] run unit tests via
pytest tests
- [ ] run recipe tests via
pytest tests -m integration_test
- [ ] manually run any new or modified recipes with sufficient proof of correctness
- [ ] include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)
UX
If your function changed a public API, please add a dummy example of what the user experience will look like when calling it. Here is a docstring example and a tutorial example
- [ ] I did not change any public API
- [ ] I have added an example to docs or docstrings