[WIP] max-autotune
Context
What is the purpose of this PR? Is it to
- [X] add a new feature
- [ ] fix a bug
- [ ] update tests and/or documentation
- [ ] other (please add here)
Please link to any issues this PR addresses.
Changelog
What are the changes made in this PR?
- max-autotune, #2373
Test plan
Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.
- [ ] run pre-commit hooks and linters (make sure you've first installed via
pre-commit install) - [ ] add unit tests for any new functionality
- [ ] update docstrings for any new or updated methods or classes
- [ ] run unit tests via
pytest tests - [ ] run recipe tests via
pytest tests -m integration_test - [ ] manually run any new or modified recipes with sufficient proof of correctness
- [ ] include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)
UX
If your function changed a public API, please add a dummy example of what the user experience will look like when calling it. Here is a docstring example and a tutorial example
- [ ] I did not change any public API
- [ ] I have added an example to docs or docstrings
Not for review right now.
:link: Helpful Links
:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2393
- :page_facing_up: Preview Python docs built from this PR
Note: Links to docs will display an error until the docs builds have been completed.
This comment was automatically generated by Dr. CI and updates every 15 minutes.
Not for review right now
qwen2.5 3B full max-autotune: false, compile: true
qwen2.5 3B lora max-autotune: false, compile: true
Had to add torch.compiler.cudagraph_mark_step_begin() as it failed with weird error without it.
It compiled ~16 minutes with max-autotune: True, loss became nan and I assume that there is no real speedup
Ah and all failed with:
RuntimeError: These live storage data ptrs are in the cudagraph pool but not accounted for as an output of cudagraph trees:
Data Pointer: 140125209566720, history: ```
Repro:
manual: fork torchfune: https://github.com/pytorch/torchtune
then:
git clone https://github.com/<YOUR_GITHUB_USER>/torchtune.git
cd torchtune
git remote add krammnic https://github.com/krammnic/torchtune.git
git remote add upstream https://github.com/pytorch/torchtune.git
git fetch krammnic
git checkout -b max-autotune krammnic/max-autotune
conda create --name max-autotune python=3.11
conda activate max-autotune
pip3 install --pre --upgrade torch torchvision torchao --index-url https://download.pytorch.org/whl/nightly/cu126
pip3 install -e .
tune download meta-llama/Llama-3.2-1B-Instruct --output-dir /tmp/Llama-3.2-1B-Instruct --ignore-patterns "original/consolidated.00.pth"
tune cp llama3_2/1B_lora_single_device .
CUDA_VISIBLE_DEVICES=0 tune run lora_finetune_single_device --config 1B_lora_single_device.yaml max_autotune=True compile=True
Findings:
- Works only with
max-autotunefor compiling flex attention - For the max-autotune for model compiling without
torch.compiler.cudagraph_mark_step_begin():
RuntimeError: Error: accessing tensor output of CUDAGraphs that has been overwritten by a subsequent run. Stack trace: [Could not find stack trace]. To prevent overwriting, clone the tensor outside of torch.compile() or call torch.compiler.cudagraph_mark_step_begin() before each model invocation.
- For the
max-autotunefor model compiling withtorch.compiler.cudagraph_mark_step_begin(): loss isnan - For the
max-autotunefor loss + flex compiling, no model, warning:
packages/torch/_inductor/cudagraph_trees.py:2345: UserWarning: Unable to hit fast path of CUDAGraphs because of pending, uninvoked backwards. Consider running with torch.no_grad() or using torch.compiler.cudagraph_mark_step_begin() before each model invocation
Then after 3 step:
RuntimeError: These live storage data ptrs are in the cudagraph pool but not accounted for as an output of cudagraph trees:
Data Pointer: 140442444234752, history:
- For loss, model, flex compiling - same as 4
Can we just expose the compile-mode and pass that into the compile method?
@joecummings Yes, let's do this