Summary

Still need to figure out this symbol

Current work around is to set: LD_PRELOAD=/usr/lib64/libcuda.so, the lazyNVRTC approach should be the correct approach but still getting Not sure why the base symbol is still being added....

❯ nm /home/drisspg/meta/pytorch/torch/lib/libtorch_cuda.so | grep cuT
                 U cuTensorMapEncodeTiled
0000000000f6d670 t _ZN2at4cuda6detail6_stubs22cuTensorMapEncodeTiledEP14CUtensorMap_st24CUtensorMapDataType_enumjPvPKmS8_PKjSA_26CUtensorMapInterleave_enum23CUtensorMapSwizzle_enum27CUtensorMapL2promotion_enum28CUtensorMapFloatOOBfill_enum
0000000000d25e47 t _ZN2at4cuda6detail6_stubs22cuTensorMapEncodeTiledEP14CUtensorMap_st24CUtensorMapDataType_enumjPvPKmS8_PKjSA_26CUtensorMapInterleave_enum23CUtensorMapSwizzle_enum27CUtensorMapL2promotion_enum28CUtensorMapFloatOOBfill_enum.cold

Apr 30 '24 00:04 drisspg

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/125204

:page_facing_up: Preview Python docs built from this PR
:page_facing_up: Preview C++ docs built from this PR
:question: Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

:white_check_mark: You can merge normally! (31 Unrelated Failures)

As of commit 44483972bdd3dcd0c047020694817210846b5d70 with merge base 44483972bdd3dcd0c047020694817210846b5d70 ():

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

windows-binary-conda / conda-py3_10-cuda11_8-test (gh) (trunk failure) Process completed with exit code 1.
windows-binary-conda / conda-py3_10-cuda12_1-test (gh) (trunk failure) Process completed with exit code 1.
windows-binary-conda / conda-py3_10-cuda12_4-test (gh) (trunk failure) Process completed with exit code 1.
windows-binary-conda / conda-py3_11-cuda11_8-test (gh) (trunk failure) Process completed with exit code 1.
windows-binary-conda / conda-py3_11-cuda12_1-test (gh) (trunk failure) Process completed with exit code 1.
windows-binary-conda / conda-py3_11-cuda12_4-test (gh) (trunk failure) Process completed with exit code 1.
windows-binary-conda / conda-py3_12-cuda11_8-test (gh) (trunk failure) Process completed with exit code 1.
windows-binary-conda / conda-py3_12-cuda12_1-test (gh) (trunk failure) Process completed with exit code 1.
windows-binary-conda / conda-py3_12-cuda12_4-test (gh) (trunk failure) Process completed with exit code 1.
windows-binary-conda / conda-py3_8-cuda11_8-test (gh) (trunk failure) Process completed with exit code 1.
windows-binary-conda / conda-py3_8-cuda12_1-test (gh) (trunk failure) Process completed with exit code 1.
windows-binary-conda / conda-py3_8-cuda12_4-test (gh) (trunk failure) Process completed with exit code 1.
windows-binary-conda / conda-py3_9-cuda11_8-test (gh) (trunk failure) Process completed with exit code 1.
windows-binary-conda / conda-py3_9-cuda12_1-test (gh) (trunk failure) Process completed with exit code 1.
windows-binary-conda / conda-py3_9-cuda12_4-test (gh) (trunk failure) Process completed with exit code 1.
windows-binary-wheel / wheel-py3_10-cuda12_1-test (gh) (trunk failure) RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM
windows-binary-wheel / wheel-py3_10-cuda12_4-test (gh) (trunk failure) RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM
windows-binary-wheel / wheel-py3_11-cuda12_1-test (gh) (trunk failure) RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM
windows-binary-wheel / wheel-py3_11-cuda12_4-test (gh) (trunk failure) RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM
windows-binary-wheel / wheel-py3_12-cuda12_1-test (gh) (trunk failure)
windows-binary-wheel / wheel-py3_12-cuda12_4-test (gh) (trunk failure) RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM
windows-binary-wheel / wheel-py3_8-cuda12_1-build (gh) (trunk failure) No files were found with the provided path: C:\actions-runner\_work\_temp/artifacts. No artifacts will be uploaded.
windows-binary-wheel / wheel-py3_8-cuda12_4-test (gh) (trunk failure) RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM
windows-binary-wheel / wheel-py3_9-cuda12_1-test (gh) (trunk failure) RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM
windows-binary-wheel / wheel-py3_9-cuda12_4-test (gh) (trunk failure) RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

inductor / cuda12.1-py3.10-gcc9-sm86 / test (aot_inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#128903) ImportError: attempted relative import with no known parent package
inductor / cuda12.1-py3.10-gcc9-sm86 / test (aot_inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#128903) ImportError: attempted relative import with no known parent package
inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#128902) ImportError: attempted relative import with no known parent package
inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#128902) ImportError: attempted relative import with no known parent package
inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#128901) ImportError: attempted relative import with no known parent package
inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#128901) ImportError: attempted relative import with no known parent package

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Apr 30 '24 00:04 pytorch-bot[bot]

❯ nm  -C /home/drisspg/meta/pytorch/torch/lib/libtorch_cuda.so | grep cuT
0000000002561b90 T cuTensorMapEncodeTiled
0000000000ef1110 t at::cuda::detail::_stubs::cuTensorMapEncodeTiled(CUtensorMap_st*, CUtensorMapDataType_enum, unsigned int, void*, unsigned long const*, unsigned long const*, unsigned int const*, unsigned int const*, CUtensorMapInterleave_enum, CUtensorMapSwizzle_enum, CUtensorMapL2promotion_enum, CUtensorMapFloatOOBfill_enum)
0000000000cf4fa7 t at::cuda::detail::_stubs::cuTensorMapEncodeTiled(CUtensorMap_st*, CUtensorMapDataType_enum, unsigned int, void*, unsigned long const*, unsigned long const*, unsigned int const*, unsigned int const*, CUtensorMapInterleave_enum, CUtensorMapSwizzle_enum, CUtensorMapL2promotion_enum, CUtensorMapFloatOOBfill_enum) [clone .cold]

This symbol shadowing doesnt seem right

May 23 '24 22:05 drisspg

After some preproc shenanigans I think I got it in a state that seems better but would love some feedback from packaging experts:

❯ nm -C /home/drisspg/meta/pytorch/torch/lib/libtorch_cuda.so | grep cuT;
0000000002561680 t nvrtc_cuTensorMapEncodeTiled(CUtensorMap_st*, CUtensorMapDataType_enum, unsigned int, void*, unsigned long const*, unsigned long const*, unsigned int const*, unsigned int const*, CUtensorMapInterleave_enum, CUtensorMapSwizzle_enum, CUtensorMapL2promotion_enum, CUtensorMapFloatOOBfill_enum) [clone .constprop.1]
0000000000ef10c0 t at::cuda::detail::_stubs::cuTensorMapEncodeTiled(CUtensorMap_st*, CUtensorMapDataType_enum, unsigned int, void*, unsigned long const*, unsigned long const*, unsigned int const*, unsigned int const*, CUtensorMapInterleave_enum, CUtensorMapSwizzle_enum, CUtensorMapL2promotion_enum, CUtensorMapFloatOOBfill_enum)
0000000000cf4f57 t at::cuda::detail::_stubs::cuTensorMapEncodeTiled(CUtensorMap_st*, CUtensorMapDataType_enum, unsigned int, void*, unsigned long const*, unsigned long const*, unsigned int const*, unsigned int const*, CUtensorMapInterleave_enum, CUtensorMapSwizzle_enum, CUtensorMapL2promotion_enum, CUtensorMapFloatOOBfill_enum) [clone .cold]

May 24 '24 04:05 drisspg

this is great! API looks good, I'll defer to others for the cutlass part.

May 28 '24 18:05 vkuzo

@pytorchbot merge

May 31 '24 20:05 drisspg

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status here

May 31 '24 20:05 pytorchmergebot

@pytorchmergebot revert -c nosignal -m "Broke nightlies and internal tests"

Jun 03 '24 14:06 atalman

Observing failure on Windows builds: https://github.com/pytorch/pytorch/actions/runs/9346105085/job/25720265925

C:/cb/pytorch_1000000000000/work/aten/src/ATen/../../../third_party/cutlass/include\cutlass/uint128.h(189): error: calling a __host__ function("_udiv128") from a __host__ __device__ function("cutlass::uint128_t::operator / const") is not allowed

1 error detected in the compilation of "C:/cb/pytorch_1000000000000/work/aten/src/ATen/native/cuda/RowwiseScaledMM.cu".
C:/cb/pytorch_1000000000000/work/aten/src/ATen/../../../third_party/cutlass/include\cutlass/uint128.h(189): error: calling a __host__ function("_udiv128") from a __host__ __device__ function("cutlass::uint128_t::operator / const") is not allowed

1 error detected in the compilation of "C:/cb/pytorch_1000000000000/work/aten/src/ATen/native/cuda/RowwiseScaledMM.cu".

Jun 03 '24 15:06 atalman

@pytorchbot successfully started a revert job. Check the current status here. Questions? Feedback? Please reach out to the PyTorch DevX Team

Jun 03 '24 15:06 pytorchmergebot

@drisspg your PR has been successfully reverted.

Jun 03 '24 15:06 pytorchmergebot

@Skylion007 I wonder if this was also updated in the newer version of Cutlass

C:/cb/pytorch_1000000000000/work/aten/src/ATen/../../../third_party/cutlass/include\cutlass/uint128.h(189): error: calling a __host__ function("_udiv128") from a __host__ __device__ function("cutlass::uint128_t::operator / const") is not allowed

Jun 03 '24 15:06 drisspg

@pytorchbot merge

Jun 05 '24 14:06 drisspg

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status here

Jun 05 '24 14:06 pytorchmergebot

Merge failed

Reason: 5 jobs have failed, first few of them are: linux-aarch64-binary-manywheel / manywheel-py3_11-cuda-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_12-cuda-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_9-cuda-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_10-cuda-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_8-cuda-aarch64-build / build

Details for Dev Infra team

Raised by workflow job

Jun 05 '24 14:06 pytorchmergebot

@pytorchbot -i

Jun 05 '24 14:06 drisspg

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: unrecognized arguments: -i

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick,close} ...

Try @pytorchbot --help for more info.

Jun 05 '24 14:06 pytorch-bot[bot]

@pytorchbot merge -i

Jun 05 '24 14:06 drisspg

Merge started

Your change will be merged while ignoring the following 5 checks: linux-aarch64-binary-manywheel / manywheel-py3_11-cuda-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_12-cuda-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_9-cuda-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_10-cuda-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_8-cuda-aarch64-build / build

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status here

Jun 05 '24 14:06 pytorchmergebot

@pytorchbot merge -f "I don think these failures are related"

Jun 05 '24 15:06 drisspg

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command For more information see pytorch-bot wiki.

Jun 05 '24 15:06 pytorchmergebot

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status here

Jun 05 '24 15:06 pytorchmergebot

@pytorchmergebot revert -c ghfirst -m "Sorry need to revert this failing, on internal CI. I suggest to reimport this and try to land internally resolving all issues"

Jun 06 '24 16:06 atalman

@pytorchbot successfully started a revert job. Check the current status here. Questions? Feedback? Please reach out to the PyTorch DevX Team

Jun 06 '24 16:06 pytorchmergebot

@drisspg your PR has been successfully reverted.

Jun 06 '24 16:06 pytorchmergebot

@drisspg has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Jun 06 '24 20:06 facebook-github-bot

@drisspg how should we resolve this for now on the extension side? <ATen/cuda/nvrtc_stub/ATenNVRTC.h> cannot be used by C++ extensions.

Jul 11 '24 00:07 cora-codes

FP8 rowwise scaling

Summary

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/125204

:white_check_mark: You can merge normally! (31 Unrelated Failures)

Merge started

Merge started

Merge failed

Merge started

Merge started