FP8 rowwise scaling
Summary
Still need to figure out this symbol
Current work around is to set: LD_PRELOAD=/usr/lib64/libcuda.so, the lazyNVRTC approach should be the correct approach but still getting
Not sure why the base symbol is still being added....
❯ nm /home/drisspg/meta/pytorch/torch/lib/libtorch_cuda.so | grep cuT
U cuTensorMapEncodeTiled
0000000000f6d670 t _ZN2at4cuda6detail6_stubs22cuTensorMapEncodeTiledEP14CUtensorMap_st24CUtensorMapDataType_enumjPvPKmS8_PKjSA_26CUtensorMapInterleave_enum23CUtensorMapSwizzle_enum27CUtensorMapL2promotion_enum28CUtensorMapFloatOOBfill_enum
0000000000d25e47 t _ZN2at4cuda6detail6_stubs22cuTensorMapEncodeTiledEP14CUtensorMap_st24CUtensorMapDataType_enumjPvPKmS8_PKjSA_26CUtensorMapInterleave_enum23CUtensorMapSwizzle_enum27CUtensorMapL2promotion_enum28CUtensorMapFloatOOBfill_enum.cold
:link: Helpful Links
:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/125204
- :page_facing_up: Preview Python docs built from this PR
- :page_facing_up: Preview C++ docs built from this PR
- :question: Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours
Note: Links to docs will display an error until the docs builds have been completed.
:white_check_mark: You can merge normally! (31 Unrelated Failures)
As of commit 44483972bdd3dcd0c047020694817210846b5d70 with merge base 44483972bdd3dcd0c047020694817210846b5d70 ():
BROKEN TRUNK - The following jobs failed but were present on the merge base:
👉 Rebase onto the `viable/strict` branch to avoid these failures
-
windows-binary-conda / conda-py3_10-cuda11_8-test (gh) (trunk failure)
Process completed with exit code 1. -
windows-binary-conda / conda-py3_10-cuda12_1-test (gh) (trunk failure)
Process completed with exit code 1. -
windows-binary-conda / conda-py3_10-cuda12_4-test (gh) (trunk failure)
Process completed with exit code 1. -
windows-binary-conda / conda-py3_11-cuda11_8-test (gh) (trunk failure)
Process completed with exit code 1. -
windows-binary-conda / conda-py3_11-cuda12_1-test (gh) (trunk failure)
Process completed with exit code 1. -
windows-binary-conda / conda-py3_11-cuda12_4-test (gh) (trunk failure)
Process completed with exit code 1. -
windows-binary-conda / conda-py3_12-cuda11_8-test (gh) (trunk failure)
Process completed with exit code 1. -
windows-binary-conda / conda-py3_12-cuda12_1-test (gh) (trunk failure)
Process completed with exit code 1. -
windows-binary-conda / conda-py3_12-cuda12_4-test (gh) (trunk failure)
Process completed with exit code 1. -
windows-binary-conda / conda-py3_8-cuda11_8-test (gh) (trunk failure)
Process completed with exit code 1. -
windows-binary-conda / conda-py3_8-cuda12_1-test (gh) (trunk failure)
Process completed with exit code 1. -
windows-binary-conda / conda-py3_8-cuda12_4-test (gh) (trunk failure)
Process completed with exit code 1. -
windows-binary-conda / conda-py3_9-cuda11_8-test (gh) (trunk failure)
Process completed with exit code 1. -
windows-binary-conda / conda-py3_9-cuda12_1-test (gh) (trunk failure)
Process completed with exit code 1. -
windows-binary-conda / conda-py3_9-cuda12_4-test (gh) (trunk failure)
Process completed with exit code 1. -
windows-binary-wheel / wheel-py3_10-cuda12_1-test (gh) (trunk failure)
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM -
windows-binary-wheel / wheel-py3_10-cuda12_4-test (gh) (trunk failure)
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM -
windows-binary-wheel / wheel-py3_11-cuda12_1-test (gh) (trunk failure)
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM -
windows-binary-wheel / wheel-py3_11-cuda12_4-test (gh) (trunk failure)
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM - windows-binary-wheel / wheel-py3_12-cuda12_1-test (gh) (trunk failure)
-
windows-binary-wheel / wheel-py3_12-cuda12_4-test (gh) (trunk failure)
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM -
windows-binary-wheel / wheel-py3_8-cuda12_1-build (gh) (trunk failure)
No files were found with the provided path: C:\actions-runner\_work\_temp/artifacts. No artifacts will be uploaded. -
windows-binary-wheel / wheel-py3_8-cuda12_4-test (gh) (trunk failure)
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM -
windows-binary-wheel / wheel-py3_9-cuda12_1-test (gh) (trunk failure)
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM -
windows-binary-wheel / wheel-py3_9-cuda12_4-test (gh) (trunk failure)
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM
UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:
-
inductor / cuda12.1-py3.10-gcc9-sm86 / test (aot_inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#128903)
ImportError: attempted relative import with no known parent package -
inductor / cuda12.1-py3.10-gcc9-sm86 / test (aot_inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#128903)
ImportError: attempted relative import with no known parent package -
inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#128902)
ImportError: attempted relative import with no known parent package -
inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#128902)
ImportError: attempted relative import with no known parent package -
inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#128901)
ImportError: attempted relative import with no known parent package -
inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#128901)
ImportError: attempted relative import with no known parent package
This comment was automatically generated by Dr. CI and updates every 15 minutes.
❯ nm -C /home/drisspg/meta/pytorch/torch/lib/libtorch_cuda.so | grep cuT
0000000002561b90 T cuTensorMapEncodeTiled
0000000000ef1110 t at::cuda::detail::_stubs::cuTensorMapEncodeTiled(CUtensorMap_st*, CUtensorMapDataType_enum, unsigned int, void*, unsigned long const*, unsigned long const*, unsigned int const*, unsigned int const*, CUtensorMapInterleave_enum, CUtensorMapSwizzle_enum, CUtensorMapL2promotion_enum, CUtensorMapFloatOOBfill_enum)
0000000000cf4fa7 t at::cuda::detail::_stubs::cuTensorMapEncodeTiled(CUtensorMap_st*, CUtensorMapDataType_enum, unsigned int, void*, unsigned long const*, unsigned long const*, unsigned int const*, unsigned int const*, CUtensorMapInterleave_enum, CUtensorMapSwizzle_enum, CUtensorMapL2promotion_enum, CUtensorMapFloatOOBfill_enum) [clone .cold]
This symbol shadowing doesnt seem right
After some preproc shenanigans I think I got it in a state that seems better but would love some feedback from packaging experts:
❯ nm -C /home/drisspg/meta/pytorch/torch/lib/libtorch_cuda.so | grep cuT;
0000000002561680 t nvrtc_cuTensorMapEncodeTiled(CUtensorMap_st*, CUtensorMapDataType_enum, unsigned int, void*, unsigned long const*, unsigned long const*, unsigned int const*, unsigned int const*, CUtensorMapInterleave_enum, CUtensorMapSwizzle_enum, CUtensorMapL2promotion_enum, CUtensorMapFloatOOBfill_enum) [clone .constprop.1]
0000000000ef10c0 t at::cuda::detail::_stubs::cuTensorMapEncodeTiled(CUtensorMap_st*, CUtensorMapDataType_enum, unsigned int, void*, unsigned long const*, unsigned long const*, unsigned int const*, unsigned int const*, CUtensorMapInterleave_enum, CUtensorMapSwizzle_enum, CUtensorMapL2promotion_enum, CUtensorMapFloatOOBfill_enum)
0000000000cf4f57 t at::cuda::detail::_stubs::cuTensorMapEncodeTiled(CUtensorMap_st*, CUtensorMapDataType_enum, unsigned int, void*, unsigned long const*, unsigned long const*, unsigned int const*, unsigned int const*, CUtensorMapInterleave_enum, CUtensorMapSwizzle_enum, CUtensorMapL2promotion_enum, CUtensorMapFloatOOBfill_enum) [clone .cold]
this is great! API looks good, I'll defer to others for the cutlass part.
@pytorchbot merge
Merge started
Your change will be merged once all checks pass (ETA 0-4 Hours).
Learn more about merging in the wiki.
Questions? Feedback? Please reach out to the PyTorch DevX TeamAdvanced Debugging
Check the merge workflow status
here
@pytorchmergebot revert -c nosignal -m "Broke nightlies and internal tests"
Observing failure on Windows builds: https://github.com/pytorch/pytorch/actions/runs/9346105085/job/25720265925
C:/cb/pytorch_1000000000000/work/aten/src/ATen/../../../third_party/cutlass/include\cutlass/uint128.h(189): error: calling a __host__ function("_udiv128") from a __host__ __device__ function("cutlass::uint128_t::operator / const") is not allowed
1 error detected in the compilation of "C:/cb/pytorch_1000000000000/work/aten/src/ATen/native/cuda/RowwiseScaledMM.cu".
C:/cb/pytorch_1000000000000/work/aten/src/ATen/../../../third_party/cutlass/include\cutlass/uint128.h(189): error: calling a __host__ function("_udiv128") from a __host__ __device__ function("cutlass::uint128_t::operator / const") is not allowed
1 error detected in the compilation of "C:/cb/pytorch_1000000000000/work/aten/src/ATen/native/cuda/RowwiseScaledMM.cu".
@pytorchbot successfully started a revert job. Check the current status here. Questions? Feedback? Please reach out to the PyTorch DevX Team
@drisspg your PR has been successfully reverted.
@Skylion007 I wonder if this was also updated in the newer version of Cutlass
C:/cb/pytorch_1000000000000/work/aten/src/ATen/../../../third_party/cutlass/include\cutlass/uint128.h(189): error: calling a __host__ function("_udiv128") from a __host__ __device__ function("cutlass::uint128_t::operator / const") is not allowed
@pytorchbot merge
Merge started
Your change will be merged once all checks pass (ETA 0-4 Hours).
Learn more about merging in the wiki.
Questions? Feedback? Please reach out to the PyTorch DevX TeamAdvanced Debugging
Check the merge workflow status
here
Merge failed
Reason: 5 jobs have failed, first few of them are: linux-aarch64-binary-manywheel / manywheel-py3_11-cuda-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_12-cuda-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_9-cuda-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_10-cuda-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_8-cuda-aarch64-build / build
Details for Dev Infra team
Raised by workflow job
@pytorchbot -i
❌ 🤖 pytorchbot command failed:
@pytorchbot: error: unrecognized arguments: -i
usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick,close} ...
Try @pytorchbot --help for more info.
@pytorchbot merge -i
Merge started
Your change will be merged while ignoring the following 5 checks: linux-aarch64-binary-manywheel / manywheel-py3_11-cuda-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_12-cuda-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_9-cuda-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_10-cuda-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_8-cuda-aarch64-build / build
Learn more about merging in the wiki.
Questions? Feedback? Please reach out to the PyTorch DevX TeamAdvanced Debugging
Check the merge workflow status
here
@pytorchbot merge -f "I don think these failures are related"
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command For more information see pytorch-bot wiki.
Merge started
Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.
Learn more about merging in the wiki.
Questions? Feedback? Please reach out to the PyTorch DevX TeamAdvanced Debugging
Check the merge workflow status
here
@pytorchmergebot revert -c ghfirst -m "Sorry need to revert this failing, on internal CI. I suggest to reimport this and try to land internally resolving all issues"
@pytorchbot successfully started a revert job. Check the current status here. Questions? Feedback? Please reach out to the PyTorch DevX Team
@drisspg your PR has been successfully reverted.
@drisspg has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@drisspg how should we resolve this for now on the extension side? <ATen/cuda/nvrtc_stub/ATenNVRTC.h> cannot be used by C++ extensions.