torchtune icon indicating copy to clipboard operation
torchtune copied to clipboard

Setting `expandable_segments:True` in our recipes.

Open SalmanMohammadi opened this issue 6 months ago • 7 comments

Context

What is the purpose of this PR? Is it to

  • [x] add a new feature
  • [ ] fix a bug
  • [ ] update tests and/or documentation
  • [ ] other (please add here)

#1185

Changelog

Adding os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True" to all our recipes.

Test plan

Run on 2080 Super 8GB.

(tune) salman@combuter:~/torchtune$ echo $PYTORCH_CUDA_ALLOC_CONF

(tune) salman@combuter:~/torchtune$ tune run full_finetune_single_device --config qwen2/0.5B_full_single_device log_peak_memory_stats=True metric_logger=torchtune.utils.metric_logging.WandBLogger metric_logging.project=torchtune_mem checkpointer.checkpoint_dir=/home/salman/models/Qwen2-0.5B-Instruct tokenizer.path=/home/salman/models/Qwen2-0.5B-Instruct/vocab.json tokenizer.merges_file=/home/salman/models/Qwen2-0.5B-Instruct/merges.txt max_steps_per_epoch=250
...
1|250|Loss: 1.0745091438293457: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 250/250 [22:37<00:00,  5.43s/it]

WandB of successful run with all peak memory stats <= 8GB.

See #1273 for evidence of the other small models and single-device recipes.

  • [x] run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
  • [x] add unit tests for any new functionality
  • [ ] update docstrings for any new or updated methods or classes
  • [x] run unit tests via pytest tests
  • [ ] run recipe tests via pytest tests -m integration_test
  • [ ] manually run any new or modified recipes with sufficient proof of correctness
  • [x] include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

SalmanMohammadi avatar Aug 12 '24 14:08 SalmanMohammadi