[Build] MIGraphX CI completely gone for MIGraphX EP builds
Describe the issue
All of the related MIGraphX Linux/Windows CI has been removed from your coverage.
This was caught as part of: https://github.com/microsoft/onnxruntime/pull/25516
This was the result of the changes found here: https://github.com/microsoft/onnxruntime/pull/25418
Subsequent discussions mentioned the need for an almalinux image to use. ROCm actively maintains this found here:
https://hub.docker.com/r/rocm/dev-almalinux-8/tags
Conda is not used here for this image if this is required as a starting point for security features.
Urgency
Extremely urgent. Mainline build breaks for MIGraphX EP during Windows side integration.
We require test coverage. Linux integration for AMD systems will be halted to main
Target platform
MI series cards
Build script
Internal build scripts. MigraphX CI is removed
Error / output
Builds work on windows but will break linux environments
Visual Studio Version
No response
GCC / Compiler Version
No response
@snnn @nieubank if GPUs are an issue, we can arrange further GPUs for your environment and testing.
Please reach out if this is the issue. Let us know how we can mitigate things further and if you require additional pieces from AMD.
cc @jeffdaily - Something we should be raising as well. This effects our release and upstreaming of features/integration with windows side changes
I am not able to get any AMD GPU currently, because AMD GPUs are in high demand. I will try to raise the request again.
I applied for https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/ndmi300xv5-series?tabs=sizebasic . Please let me know if there are other choices.
And we lack instructions of how to setup ROCM windows build.
Why do you need a windows build for Linux CI?
Windows support should be here - https://rocm.docs.amd.com/projects/install-on-windows/en/latest/
Can we get the Linux CI up and running? Why was that removed?
I am not able to get any AMD GPU currently, because AMD GPUs are in high demand. I will try to raise the request again.
Right, but this CI was working until last week until support was pulled?
@jeffmend is there anything further AMD needs so we can get CI back up and running?
I can't seem to find Meng Tangs handle to ping him as well. Looking to get a timeline and any other resources so we can get support back before introducing back in Windows builds.
Mainly, there are two things:
-
We do not have AMD GPU machines. The only SKU I found in Azure is an 8-GPUs MI300x that is for training purposes only. It's very popular. I cannot get it. Even if we got some, they will be very expensive and we cannot make a good use of them, since ORT's build pipelines only use one GPU a time. There might be some other SKUs could be work. But we will your help to confirm/verify.
-
The docker file we used for Linux build has several security and license issues. We need people to fix the issue. I have put the details in your last PR at https://github.com/microsoft/onnxruntime/pull/25338#issuecomment-3076634856
If we could get a Windows build work, we don't need to bother with docker. And I think right now the integration work is mainly for Windows.
Right, but again, we still require the Linux CI build for Onnxruntime and AMD has a publicly available almalinux build hosted on dockerhub similar to the Nvidia cases you use. I still don't see why we've lost complete support for Linux builds here. we have all the components and this effects customers we're currently supporting.
Mainly, there are two things:
1. We do not have AMD GPU machines. The only SKU I found in Azure is an 8-GPUs MI300x that is for training purposes only. It's very popular. I cannot get it. Even if we got some, they will be very expensive and we cannot make a good use of them, since ORT's build pipelines only use one GPU a time. There might be some other SKUs could be work. But we will your help to confirm/verify. 2. The docker file we used for Linux build has several security and license issues. We need people to fix the issue. I have put the details in your last PR at [[MIGraphx EP] Sync AMD changes upstream #25338 (comment)](https://github.com/microsoft/onnxruntime/pull/25338#issuecomment-3076634856)If we could get a Windows build work, we don't need to bother with docker. And I think right now the integration work is mainly for Windows.
We require both windows and linux builds if we want to ensure proper integration as the linux environments were supported previously. What you're essentially saying by removing CI is that *Nix doesn't matter anymore in lieu of Windows which disregards previous releases.
cc @skottmckay
@snnn any update on this?
Hey guys any update here?
I am still in an active discussion with our capacity team and engineering system team. They are helping me. But, it seems that we are the first one at Microsoft who have this need. So there is no paved road.
Is there anything further you require from AMD? We did have MIGraphX EP CI running before....I'm still wrapping my head around everything to be honest. We're currently not getting any CI coverage still for AMD related contributions.
Hey @snnn did you get any updates on this? Anyway we can discuss or get a list of what's required here to get our CI reinstated?
Hey just asking again, whats the status on this? Any thing further needed from AMD? Is the message here that you've just removed CI without a rhyme or reason now? Its been since July and we haven't gotten any clarity or answer on this.
Hi @TedThemistokleous ,
Thank you for your continued contributions; your partnership is genuinely appreciated.
Following up on CI build issue, the ONNX Runtime team discussed this again last week. We are still finalizing the best approach and will let you know when we have an update.
Meanwhile, my request for MI300X GPUs was rejected as there is no capacity. As an alternative, does MIGraphX support the AMD Radeon™ Pro V710? That could be a more accessible option for us. If it is also not possible, we could add a pipeline that only compiles the code but do not run the tests.
One critical point I need to reiterate is that Windows support is our top priority. , and we will focus more on the consumer market than data center GPUs.
I will be on vacation till the end of this month. To make sure you get the support you need, please direct any future questions or PRs that need review to @devang-ml. He is the PoC for all EP-related work.
We look forward to continuing our collaboration.
p.s. I love AMD hardware. I personally have an AMD Radeon™ AI PRO R9700. I am willing to see ONNX Runtime could unleash the beast performance on AMD GPUs.
Hi @TedThemistokleous ,
Thank you for your continued contributions; your partnership is genuinely appreciated.
Likewise to the feedback here.
Following up on CI build issue, the ONNX Runtime team discussed this again last week. We are still finalizing the best approach and will let you know when we have an update.
Meanwhile, my request for MI300X GPUs was rejected as there is no capacity. As an alternative, does MIGraphX support the AMD Radeon™ Pro V710? That could be a more accessible option for us. If it is also not possible, we could add a pipeline that only compiles the code but do not run the tests.
yes actually we have an effort for ROCm on Radeon - https://rocm.docs.amd.com/projects/radeon-ryzen/en/latest/index.html
V710 is and absolutely acceptable usecase here for development/testing/CI. We have various customers using V710 for their production work and bring up of systems/clusters. If the feature is on the V710s it certainly required to function for MI series cards as well. They should be interchangable.
One critical point I need to reiterate is that Windows support is our top priority. , and we will focus more on the consumer market than data center GPUs.
No problem at all, we do have an effort that supports Windows. Our UAI work internally tests and confirms that for windows Onnxruntime builds and grabs MIGraphX side changes to support the EP.
If you're building off MIGraphX develop/ appropriate release branches + Onnxruntime that should give you sufficient coverage. MIGraphX builds are cut off our release/rocm-rel-X.Y and all changes for OnnxRT builds are tested and pushed into rocmX.Y_internal_testing in rocm/Onnxruntime before I upstream them to your mainline. So if you select a rocm version as your base container, the apt install migraphx-dev will give you the blessed QA code from that corresponding release/rocm-rel-X.Y branch.
These should all build and work for windows/Linux as the code was refactored prior to support both. Same for the Onnxruntime side.
I will be on vacation till the end of this month. To make sure you get the support you need, please direct any future questions or PRs that need review to @devang-ml. He is the PoC for all EP-related work.
Will do, appreciate the coverage. I have a few more changes coming down the pipe for us. I'll make sure to tag @devang-ml
p.s. I love AMD hardware. I personally have an AMD Radeon™ AI PRO R9700. I am willing to see ONNX Runtime could unleash the beast performance on AMD GPUs.
<3
@tianleiwu @snnn @skottmckay @devang-ml @hisham-hchowdhu
I've opened this issue recently - https://github.com/microsoft/onnxruntime/issues/26801
This is directly related to not having any CI coverage.
These are changes from the Microsoft side that's continually breaking our builds when we take changes off mainline.