pytorch [ROCm] TunableOp improvements

use less memory; smaller default hipblaslt workspace size
options to avoid cache effects
- icache flush option
- rotating buffers during tuning
python APIs
unit tests

cc @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang

Apr 18 '24 05:04 jeffdaily

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/124362

:page_facing_up: Preview Python docs built from this PR
:page_facing_up: Preview C++ docs built from this PR
:question: Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

:x: 1 New Failure

As of commit b3f232b8593e127ef1d78aed8d588aa489d47cfb with merge base 090a031d6f146da3691e99f5f86220d766af4b10 ():

NEW FAILURE - The following job has failed:

trunk / macos-13-py3-arm64 / build (gh) /Users/ec2-user/runner/_work/pytorch/pytorch/c10/util/StringUtil.cpp:45:8: error: 'wstring_convert<std::codecvt_utf8_utf16<wchar_t>>' is deprecated [-Werror,-Wdeprecated-declarations]

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Apr 18 '24 05:04 pytorch-bot[bot]

@xw285cornell has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Apr 18 '24 20:04 facebook-github-bot

Thanks @jeffdaily ! QQ are you going to expose those icache stuff with env var or just the API?

Apr 18 '24 20:04 xw285cornell

Thanks @jeffdaily ! QQ are you going to expose those icache stuff with env var or just the API?

I've got a few more commits to land, not just for fixing the failing CUDA UTs. All important options will be available via env var and python APIs, with the env vars taking precedent.

Apr 18 '24 22:04 jeffdaily

@pytorchbot rebase

Apr 23 '24 15:04 jeffdaily

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

Apr 23 '24 15:04 pytorchmergebot

Successfully rebased tunableop-improvements onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout tunableop-improvements && git pull --rebase)

Apr 23 '24 15:04 pytorchmergebot

@jianyuh please re-import so the Meta Internal-Only Changes Check can pass.

@albanD documentation is updated.

Apr 24 '24 16:04 jeffdaily

@xw285cornell has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Apr 30 '24 01:04 facebook-github-bot

@xw285cornell or @albanD can we get this landed? Needs Meta Internal-Only performed. Huggingface is depending on these changes, among others. Thanks.

May 20 '24 16:05 jeffdaily

@jeffdaily do you think that new error reported is related? (it looks it does)

May 23 '24 03:05 xw285cornell

@jeffdaily do you think that new error reported is related? (it looks it does)

I've looked at the failure log, and though I see the stack trace includes at::cuda::blas::gemm_internal_cublas that was introduced during the first TunableOp PR, I don't see the failure relating to this current PR. Am I missing something from the error logs?

May 28 '24 22:05 jeffdaily

hmm, do you think this test is already broken on trunk because of the first tunableop PR? Regardless, it may be a good idea to add "gemm_internal_cublas" into the error stack of the unit test to make it happy I think.

Also what do you think about changing the default of icache flush and rotating buffer to true?

May 29 '24 07:05 xw285cornell

@pytorchbot rebase

May 29 '24 08:05 xw285cornell

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

May 29 '24 08:05 pytorchmergebot

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/124362/head returned non-zero exit code 1

Rebasing (1/25)
Rebasing (2/25)
Rebasing (3/25)
Rebasing (4/25)
Rebasing (5/25)
Rebasing (6/25)
Rebasing (7/25)
Rebasing (8/25)
Rebasing (9/25)
Rebasing (10/25)
Rebasing (11/25)
Rebasing (12/25)
Rebasing (13/25)
Rebasing (14/25)
Rebasing (15/25)
Rebasing (16/25)
Auto-merging aten/src/ATen/cuda/tunable/TunableGemm.h
CONFLICT (content): Merge conflict in aten/src/ATen/cuda/tunable/TunableGemm.h
error: could not apply a7d9ae9c528... add PYTORCH_TUNABLEOP_ROCBLAS_ENABLED
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Could not apply a7d9ae9c528... add PYTORCH_TUNABLEOP_ROCBLAS_ENABLED

Raised by https://github.com/pytorch/pytorch/actions/runs/9282691053

May 29 '24 08:05 pytorchmergebot

@xw285cornell the failing cuda tests have been seen before. Not due to this PR or the prior tunableop PR. https://github.com/pytorch/pytorch/issues/103369

May 29 '24 21:05 jeffdaily

Discussed offline, it looks some pre-existing flaky test. @jeffdaily will also have one more commit switching the default rotating tensor and icache flush to true.

May 30 '24 01:05 xw285cornell

@xw285cornell has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Jun 03 '24 20:06 facebook-github-bot

@pytorchbot merge

Jun 03 '24 21:06 jeffdaily

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status here

Jun 03 '24 21:06 pytorchmergebot

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / macos-13-py3-arm64 / build

Details for Dev Infra team

Raised by workflow job

Jun 03 '24 21:06 pytorchmergebot

@pytorchbot merge -f "unrelated build failure on mac"

Jun 03 '24 22:06 jeffdaily

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status here

Jun 03 '24 22:06 pytorchmergebot

@jeffdaily I have what appears to be a memory leak using tunableop. Notably, I am not using PYTORCH_TUNABLEOP_NUMERICAL_CHECK=1 https://github.com/pytorch/pytorch/pull/129281.

Essentially, after running a few forward passes, the available memory with PYTORCH_TUNABLEOP_ENABLED=1 is significantly lower than with PYTORCH_TUNABLEOP_ENABLED=0. Is this expected?

Edit: well explicitely setting PYTORCH_TUNABLEOP_NUMERICAL_CHECK=0 much reduces the leak (still a very small one, note I am not using the fix https://github.com/pytorch/pytorch/pull/129281). Surprising as the doc says this defaults anyway to 0 https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/cuda/tunable#environment-variable-interface

Jun 28 '24 12:06 fxmarty

#129281 fixes two things. First, numeric check feature was changed to off by default but the env var parsing wasn't updated to allow you to turn it back on. The hotfix PR also fixes the mem leak when using numeric check. Please use the hotfix. I was not able to detect any further mem leak after the hotfix.

Even with the hotfix, you might be seeing normal behavior that seems like a mem leak. Any time a new kernel is run the code object has to be copied into device memory. So if you have a diverse workload such as trying every possible gemm kernel and picking the fastest, you'll see more memory pressure just due to kernels taking up device memory.

Jun 28 '24 16:06 jeffdaily

Thanks a lot @jeffdaily

Jul 01 '24 08:07 fxmarty

pytorch pytorch copied to clipboard

[ROCm] TunableOp improvements

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/124362

:x: 1 New Failure

Merge started

Merge failed

Merge started

pytorch
pytorch copied to clipboard