pytorch
pytorch copied to clipboard
[ROCm] TunableOp improvements
- use less memory; smaller default hipblaslt workspace size
- options to avoid cache effects
- icache flush option
- rotating buffers during tuning
- python APIs
- unit tests
cc @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang
:link: Helpful Links
:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/124362
- :page_facing_up: Preview Python docs built from this PR
- :page_facing_up: Preview C++ docs built from this PR
- :question: Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours
Note: Links to docs will display an error until the docs builds have been completed.
:x: 1 New Failure
As of commit b3f232b8593e127ef1d78aed8d588aa489d47cfb with merge base 090a031d6f146da3691e99f5f86220d766af4b10 ():
NEW FAILURE - The following job has failed:
- trunk / macos-13-py3-arm64 / build (gh)
/Users/ec2-user/runner/_work/pytorch/pytorch/c10/util/StringUtil.cpp:45:8: error: 'wstring_convert<std::codecvt_utf8_utf16<wchar_t>>' is deprecated [-Werror,-Wdeprecated-declarations]
This comment was automatically generated by Dr. CI and updates every 15 minutes.
@xw285cornell has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
Thanks @jeffdaily ! QQ are you going to expose those icache stuff with env var or just the API?
Thanks @jeffdaily ! QQ are you going to expose those icache stuff with env var or just the API?
I've got a few more commits to land, not just for fixing the failing CUDA UTs. All important options will be available via env var and python APIs, with the env vars taking precedent.
@pytorchbot rebase
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here
Successfully rebased tunableop-improvements onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout tunableop-improvements && git pull --rebase)
@jianyuh please re-import so the Meta Internal-Only Changes Check can pass.
@albanD documentation is updated.
@xw285cornell has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@xw285cornell or @albanD can we get this landed? Needs Meta Internal-Only performed. Huggingface is depending on these changes, among others. Thanks.
@jeffdaily do you think that new error reported is related? (it looks it does)
@jeffdaily do you think that new error reported is related? (it looks it does)
I've looked at the failure log, and though I see the stack trace includes at::cuda::blas::gemm_internal_cublas that was introduced during the first TunableOp PR, I don't see the failure relating to this current PR. Am I missing something from the error logs?
hmm, do you think this test is already broken on trunk because of the first tunableop PR? Regardless, it may be a good idea to add "gemm_internal_cublas" into the error stack of the unit test to make it happy I think.
Also what do you think about changing the default of icache flush and rotating buffer to true?
@pytorchbot rebase
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here
Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/124362/head returned non-zero exit code 1
Rebasing (1/25)
Rebasing (2/25)
Rebasing (3/25)
Rebasing (4/25)
Rebasing (5/25)
Rebasing (6/25)
Rebasing (7/25)
Rebasing (8/25)
Rebasing (9/25)
Rebasing (10/25)
Rebasing (11/25)
Rebasing (12/25)
Rebasing (13/25)
Rebasing (14/25)
Rebasing (15/25)
Rebasing (16/25)
Auto-merging aten/src/ATen/cuda/tunable/TunableGemm.h
CONFLICT (content): Merge conflict in aten/src/ATen/cuda/tunable/TunableGemm.h
error: could not apply a7d9ae9c528... add PYTORCH_TUNABLEOP_ROCBLAS_ENABLED
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Could not apply a7d9ae9c528... add PYTORCH_TUNABLEOP_ROCBLAS_ENABLED
Raised by https://github.com/pytorch/pytorch/actions/runs/9282691053
@xw285cornell the failing cuda tests have been seen before. Not due to this PR or the prior tunableop PR. https://github.com/pytorch/pytorch/issues/103369
Discussed offline, it looks some pre-existing flaky test. @jeffdaily will also have one more commit switching the default rotating tensor and icache flush to true.
@xw285cornell has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@pytorchbot merge
Merge started
Your change will be merged once all checks pass (ETA 0-4 Hours).
Learn more about merging in the wiki.
Questions? Feedback? Please reach out to the PyTorch DevX TeamAdvanced Debugging
Check the merge workflow status
here
Merge failed
Reason: 1 jobs have failed, first few of them are: trunk / macos-13-py3-arm64 / build
Details for Dev Infra team
Raised by workflow job
@pytorchbot merge -f "unrelated build failure on mac"
Merge started
Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.
Learn more about merging in the wiki.
Questions? Feedback? Please reach out to the PyTorch DevX TeamAdvanced Debugging
Check the merge workflow status
here
@jeffdaily I have what appears to be a memory leak using tunableop. Notably, I am not using PYTORCH_TUNABLEOP_NUMERICAL_CHECK=1 https://github.com/pytorch/pytorch/pull/129281.
Essentially, after running a few forward passes, the available memory with PYTORCH_TUNABLEOP_ENABLED=1 is significantly lower than with PYTORCH_TUNABLEOP_ENABLED=0. Is this expected?
Edit: well explicitely setting PYTORCH_TUNABLEOP_NUMERICAL_CHECK=0 much reduces the leak (still a very small one, note I am not using the fix https://github.com/pytorch/pytorch/pull/129281). Surprising as the doc says this defaults anyway to 0 https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/cuda/tunable#environment-variable-interface
#129281 fixes two things. First, numeric check feature was changed to off by default but the env var parsing wasn't updated to allow you to turn it back on. The hotfix PR also fixes the mem leak when using numeric check. Please use the hotfix. I was not able to detect any further mem leak after the hotfix.
Even with the hotfix, you might be seeing normal behavior that seems like a mem leak. Any time a new kernel is run the code object has to be copied into device memory. So if you have a diverse workload such as trying every possible gemm kernel and picking the fastest, you'll see more memory pressure just due to kernels taking up device memory.
Thanks a lot @jeffdaily