pytorch icon indicating copy to clipboard operation
pytorch copied to clipboard

[ROCm] TunableOp improvements

Open jeffdaily opened this issue 1 year ago • 9 comments

  • use less memory; smaller default hipblaslt workspace size
  • options to avoid cache effects
    • icache flush option
    • rotating buffers during tuning
  • python APIs
  • unit tests

cc @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang

jeffdaily avatar Apr 18 '24 05:04 jeffdaily

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/124362

Note: Links to docs will display an error until the docs builds have been completed.

:x: 1 New Failure

As of commit b3f232b8593e127ef1d78aed8d588aa489d47cfb with merge base 090a031d6f146da3691e99f5f86220d766af4b10 (image):

NEW FAILURE - The following job has failed:

  • trunk / macos-13-py3-arm64 / build (gh) /Users/ec2-user/runner/_work/pytorch/pytorch/c10/util/StringUtil.cpp:45:8: error: 'wstring_convert<std::codecvt_utf8_utf16<wchar_t>>' is deprecated [-Werror,-Wdeprecated-declarations]

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot[bot] avatar Apr 18 '24 05:04 pytorch-bot[bot]

@xw285cornell has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Apr 18 '24 20:04 facebook-github-bot

Thanks @jeffdaily ! QQ are you going to expose those icache stuff with env var or just the API?

xw285cornell avatar Apr 18 '24 20:04 xw285cornell

Thanks @jeffdaily ! QQ are you going to expose those icache stuff with env var or just the API?

I've got a few more commits to land, not just for fixing the failing CUDA UTs. All important options will be available via env var and python APIs, with the env vars taking precedent.

jeffdaily avatar Apr 18 '24 22:04 jeffdaily

@pytorchbot rebase

jeffdaily avatar Apr 23 '24 15:04 jeffdaily

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot avatar Apr 23 '24 15:04 pytorchmergebot

Successfully rebased tunableop-improvements onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout tunableop-improvements && git pull --rebase)

pytorchmergebot avatar Apr 23 '24 15:04 pytorchmergebot

@jianyuh please re-import so the Meta Internal-Only Changes Check can pass.

@albanD documentation is updated.

jeffdaily avatar Apr 24 '24 16:04 jeffdaily

@xw285cornell has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Apr 30 '24 01:04 facebook-github-bot

@xw285cornell or @albanD can we get this landed? Needs Meta Internal-Only performed. Huggingface is depending on these changes, among others. Thanks.

jeffdaily avatar May 20 '24 16:05 jeffdaily

@jeffdaily do you think that new error reported is related? (it looks it does)

xw285cornell avatar May 23 '24 03:05 xw285cornell

@jeffdaily do you think that new error reported is related? (it looks it does)

I've looked at the failure log, and though I see the stack trace includes at::cuda::blas::gemm_internal_cublas that was introduced during the first TunableOp PR, I don't see the failure relating to this current PR. Am I missing something from the error logs?

jeffdaily avatar May 28 '24 22:05 jeffdaily

hmm, do you think this test is already broken on trunk because of the first tunableop PR? Regardless, it may be a good idea to add "gemm_internal_cublas" into the error stack of the unit test to make it happy I think.

Also what do you think about changing the default of icache flush and rotating buffer to true?

xw285cornell avatar May 29 '24 07:05 xw285cornell

@pytorchbot rebase

xw285cornell avatar May 29 '24 08:05 xw285cornell

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot avatar May 29 '24 08:05 pytorchmergebot

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/124362/head returned non-zero exit code 1

Rebasing (1/25)
Rebasing (2/25)
Rebasing (3/25)
Rebasing (4/25)
Rebasing (5/25)
Rebasing (6/25)
Rebasing (7/25)
Rebasing (8/25)
Rebasing (9/25)
Rebasing (10/25)
Rebasing (11/25)
Rebasing (12/25)
Rebasing (13/25)
Rebasing (14/25)
Rebasing (15/25)
Rebasing (16/25)
Auto-merging aten/src/ATen/cuda/tunable/TunableGemm.h
CONFLICT (content): Merge conflict in aten/src/ATen/cuda/tunable/TunableGemm.h
error: could not apply a7d9ae9c528... add PYTORCH_TUNABLEOP_ROCBLAS_ENABLED
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Could not apply a7d9ae9c528... add PYTORCH_TUNABLEOP_ROCBLAS_ENABLED

Raised by https://github.com/pytorch/pytorch/actions/runs/9282691053

pytorchmergebot avatar May 29 '24 08:05 pytorchmergebot

@xw285cornell the failing cuda tests have been seen before. Not due to this PR or the prior tunableop PR. https://github.com/pytorch/pytorch/issues/103369

jeffdaily avatar May 29 '24 21:05 jeffdaily

Discussed offline, it looks some pre-existing flaky test. @jeffdaily will also have one more commit switching the default rotating tensor and icache flush to true.

xw285cornell avatar May 30 '24 01:05 xw285cornell

@xw285cornell has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Jun 03 '24 20:06 facebook-github-bot

@pytorchbot merge

jeffdaily avatar Jun 03 '24 21:06 jeffdaily

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging Check the merge workflow status here

pytorchmergebot avatar Jun 03 '24 21:06 pytorchmergebot

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / macos-13-py3-arm64 / build

Details for Dev Infra team Raised by workflow job

pytorchmergebot avatar Jun 03 '24 21:06 pytorchmergebot

@pytorchbot merge -f "unrelated build failure on mac"

jeffdaily avatar Jun 03 '24 22:06 jeffdaily

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging Check the merge workflow status here

pytorchmergebot avatar Jun 03 '24 22:06 pytorchmergebot

@jeffdaily I have what appears to be a memory leak using tunableop. Notably, I am not using PYTORCH_TUNABLEOP_NUMERICAL_CHECK=1 https://github.com/pytorch/pytorch/pull/129281.

Essentially, after running a few forward passes, the available memory with PYTORCH_TUNABLEOP_ENABLED=1 is significantly lower than with PYTORCH_TUNABLEOP_ENABLED=0. Is this expected?

Edit: well explicitely setting PYTORCH_TUNABLEOP_NUMERICAL_CHECK=0 much reduces the leak (still a very small one, note I am not using the fix https://github.com/pytorch/pytorch/pull/129281). Surprising as the doc says this defaults anyway to 0 https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/cuda/tunable#environment-variable-interface

fxmarty avatar Jun 28 '24 12:06 fxmarty

#129281 fixes two things. First, numeric check feature was changed to off by default but the env var parsing wasn't updated to allow you to turn it back on. The hotfix PR also fixes the mem leak when using numeric check. Please use the hotfix. I was not able to detect any further mem leak after the hotfix.

Even with the hotfix, you might be seeing normal behavior that seems like a mem leak. Any time a new kernel is run the code object has to be copied into device memory. So if you have a diverse workload such as trying every possible gemm kernel and picking the fastest, you'll see more memory pressure just due to kernels taking up device memory.

jeffdaily avatar Jun 28 '24 16:06 jeffdaily

Thanks a lot @jeffdaily

fxmarty avatar Jul 01 '24 08:07 fxmarty