pytorch AOTInductor cpp_wrapper: fix output code interception

Stack from ghstack (oldest at bottom):

#140620
#141176
#141175
-> #141174

Ensure that only the second run of output code generation on GPU actually gets returned. This fixes cases where a single FW and BW pass are assumed.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

Nov 20 '24 22:11 benjaminglass1

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/141174

:page_facing_up: Preview Python docs built from this PR
:page_facing_up: Preview C++ docs built from this PR
:question: Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

:heavy_exclamation_mark: 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[DomainsOnly] Jobs fail with GLIBC version not found

:x: 1 New Failure

As of commit 645e38bed04dc65cfcabcf407a30af67d2264ebd with merge base 740d1eb0306f1f9d0ce81ea81f287a6b52738fab ():

NEW FAILURE - The following job has failed:

inductor / unit-test / cuda12.1-py3.10-gcc9-sm86 / test (inductor_cpp_wrapper, 1, 1, linux.g5.4xlarge.nvidia.gpu) (gh) inductor/test_torchinductor.py::GPUTests::test_conv_inference_heuristics_cuda

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Nov 20 '24 22:11 pytorch-bot[bot]

@desertfire I'm not entirely sure whether this passes CI yet or not (new failures keep popping up), but I wanted your input on the approach of this PR. It seems to be six of one, half a dozen of another. Either:

a) we only log the output code from the final run of the GPU cpp_wrapper codegen, and then have to update all the tests checking for triton-specific code in the output, or b) we log the output code from both runs, and then have to update all places that assume only a single kernel's worth of forward and backward pass code will be returned.

Nov 22 '24 01:11 benjaminglass1

I am recycling my old PR on the one-pass implementation. I think you can work on other issues and waiting for my PR to land for this one. I will link my PR here when it's ready.

Nov 22 '24 18:11 desertfire

@desertfire Sounds good! I'll rebase this out of the stack and hopefully everything else will pass.

Nov 22 '24 18:11 benjaminglass1