pytorch
pytorch copied to clipboard
Backport: Enable MI355X PyTorch CI testing (#158889)
Original patch from saienduri [email protected]
This PR consists of all the changes required to enable PyTorch ROCm CI on MI355X nodes.
- Rework aotriton cmake configuration to rely on
HIP_VERSIONinstead ofROCM_VERSIONas aotriton depnds on hip. Hip loosely track the rocm major version, but the two are not actually synchronized as observed in the ROCm 7 alpha build. - Bump composable-kernel submodule to df6023e305f389bbf7249b0c4414e649f3ad6598 for mi350 compatibility.
- Extend the change docker permissions step to the MI355x runners as well. This step is included to apply the required permission change to the test folder for a successful upload of artifacts in k8s docker.
- Create new rocm-mi355 workflow to trigger core PyTorch tests on a nightly basis at 2:30 am PST.
- Successfully tested running the test suites listed in rocm-mi355.yml on MI355 runners by temporarily hacking rocm-mi300.yml: https://hud.pytorch.org/pytorch/pytorch/commit/ca7d5fae112558ee3dde7ec3ce32e94b13f877fd#rocm-mi300
Unlike the original patch, this patch version does not change the __AOTRITON_SHA256_LIST for rocm 6.5. (Change of that would cause sha256 error during the build time)
Fixes #2411
Jenkins build for a777f31d8b922de52af9a2a55075ca517d901846 commit finished as FAILURE Links: Blue Ocean view / Build artifacts