pytorch icon indicating copy to clipboard operation
pytorch copied to clipboard

Backport: Enable MI355X PyTorch CI testing (#158889)

Open lamikr opened this issue 5 months ago • 1 comments

Original patch from saienduri [email protected]

This PR consists of all the changes required to enable PyTorch ROCm CI on MI355X nodes.

  • Rework aotriton cmake configuration to rely on HIP_VERSION instead of ROCM_VERSION as aotriton depnds on hip. Hip loosely track the rocm major version, but the two are not actually synchronized as observed in the ROCm 7 alpha build.
  • Bump composable-kernel submodule to df6023e305f389bbf7249b0c4414e649f3ad6598 for mi350 compatibility.
  • Extend the change docker permissions step to the MI355x runners as well. This step is included to apply the required permission change to the test folder for a successful upload of artifacts in k8s docker.
  • Create new rocm-mi355 workflow to trigger core PyTorch tests on a nightly basis at 2:30 am PST.
  • Successfully tested running the test suites listed in rocm-mi355.yml on MI355 runners by temporarily hacking rocm-mi300.yml: https://hud.pytorch.org/pytorch/pytorch/commit/ca7d5fae112558ee3dde7ec3ce32e94b13f877fd#rocm-mi300

Unlike the original patch, this patch version does not change the __AOTRITON_SHA256_LIST for rocm 6.5. (Change of that would cause sha256 error during the build time)

Fixes #2411

lamikr avatar Jul 25 '25 02:07 lamikr

Jenkins build for a777f31d8b922de52af9a2a55075ca517d901846 commit finished as FAILURE Links: Blue Ocean view / Build artifacts