torchtune
torchtune copied to clipboard
Setting `expandable_segments:True` in our recipes.
Context
What is the purpose of this PR? Is it to
- [x] add a new feature
- [ ] fix a bug
- [ ] update tests and/or documentation
- [ ] other (please add here)
#1185
Changelog
Adding os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
to all our recipes.
Test plan
Run on 2080 Super 8GB.
(tune) salman@combuter:~/torchtune$ echo $PYTORCH_CUDA_ALLOC_CONF
(tune) salman@combuter:~/torchtune$ tune run full_finetune_single_device --config qwen2/0.5B_full_single_device log_peak_memory_stats=True metric_logger=torchtune.utils.metric_logging.WandBLogger metric_logging.project=torchtune_mem checkpointer.checkpoint_dir=/home/salman/models/Qwen2-0.5B-Instruct tokenizer.path=/home/salman/models/Qwen2-0.5B-Instruct/vocab.json tokenizer.merges_file=/home/salman/models/Qwen2-0.5B-Instruct/merges.txt max_steps_per_epoch=250
...
1|250|Loss: 1.0745091438293457: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 250/250 [22:37<00:00, 5.43s/it]
WandB of successful run with all peak memory stats <= 8GB.
See #1273 for evidence of the other small models and single-device recipes.
- [x] run pre-commit hooks and linters (make sure you've first installed via
pre-commit install
) - [x] add unit tests for any new functionality
- [ ] update docstrings for any new or updated methods or classes
- [x] run unit tests via
pytest tests
- [ ] run recipe tests via
pytest tests -m integration_test
- [ ] manually run any new or modified recipes with sufficient proof of correctness
- [x] include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)