onnxruntime icon indicating copy to clipboard operation
onnxruntime copied to clipboard

[Build] MIGraphX CI completely gone for MIGraphX EP builds

Open TedThemistokleous opened this issue 5 months ago • 21 comments

Describe the issue

All of the related MIGraphX Linux/Windows CI has been removed from your coverage.

This was caught as part of: https://github.com/microsoft/onnxruntime/pull/25516

This was the result of the changes found here: https://github.com/microsoft/onnxruntime/pull/25418

Subsequent discussions mentioned the need for an almalinux image to use. ROCm actively maintains this found here:

https://hub.docker.com/r/rocm/dev-almalinux-8/tags

Conda is not used here for this image if this is required as a starting point for security features.

Urgency

Extremely urgent. Mainline build breaks for MIGraphX EP during Windows side integration.

We require test coverage. Linux integration for AMD systems will be halted to main

Target platform

MI series cards

Build script

Internal build scripts. MigraphX CI is removed

Error / output

Builds work on windows but will break linux environments

Visual Studio Version

No response

GCC / Compiler Version

No response

TedThemistokleous avatar Jul 25 '25 02:07 TedThemistokleous

@snnn @nieubank if GPUs are an issue, we can arrange further GPUs for your environment and testing.

Please reach out if this is the issue. Let us know how we can mitigate things further and if you require additional pieces from AMD.

TedThemistokleous avatar Jul 25 '25 02:07 TedThemistokleous

cc @jeffdaily - Something we should be raising as well. This effects our release and upstreaming of features/integration with windows side changes

TedThemistokleous avatar Jul 25 '25 02:07 TedThemistokleous

I am not able to get any AMD GPU currently, because AMD GPUs are in high demand. I will try to raise the request again.

snnn avatar Jul 25 '25 02:07 snnn

I applied for https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/ndmi300xv5-series?tabs=sizebasic . Please let me know if there are other choices.

snnn avatar Jul 25 '25 02:07 snnn

And we lack instructions of how to setup ROCM windows build.

snnn avatar Jul 25 '25 03:07 snnn

Why do you need a windows build for Linux CI?

TedThemistokleous avatar Jul 25 '25 18:07 TedThemistokleous

Windows support should be here - https://rocm.docs.amd.com/projects/install-on-windows/en/latest/

Can we get the Linux CI up and running? Why was that removed?

I am not able to get any AMD GPU currently, because AMD GPUs are in high demand. I will try to raise the request again.

Right, but this CI was working until last week until support was pulled?

TedThemistokleous avatar Jul 25 '25 18:07 TedThemistokleous

@jeffmend is there anything further AMD needs so we can get CI back up and running?

I can't seem to find Meng Tangs handle to ping him as well. Looking to get a timeline and any other resources so we can get support back before introducing back in Windows builds.

TedThemistokleous avatar Jul 25 '25 20:07 TedThemistokleous

Mainly, there are two things:

  1. We do not have AMD GPU machines. The only SKU I found in Azure is an 8-GPUs MI300x that is for training purposes only. It's very popular. I cannot get it. Even if we got some, they will be very expensive and we cannot make a good use of them, since ORT's build pipelines only use one GPU a time. There might be some other SKUs could be work. But we will your help to confirm/verify.

  2. The docker file we used for Linux build has several security and license issues. We need people to fix the issue. I have put the details in your last PR at https://github.com/microsoft/onnxruntime/pull/25338#issuecomment-3076634856

If we could get a Windows build work, we don't need to bother with docker. And I think right now the integration work is mainly for Windows.

snnn avatar Jul 25 '25 21:07 snnn

Right, but again, we still require the Linux CI build for Onnxruntime and AMD has a publicly available almalinux build hosted on dockerhub similar to the Nvidia cases you use. I still don't see why we've lost complete support for Linux builds here. we have all the components and this effects customers we're currently supporting.

TedThemistokleous avatar Jul 28 '25 15:07 TedThemistokleous

Mainly, there are two things:

1. We do not have AMD GPU machines. The only SKU I found in Azure is an 8-GPUs MI300x that is for training purposes only. It's very popular. I cannot get it. Even if we got some, they will be very expensive and we cannot make a good use of them, since ORT's build pipelines only use one GPU a time. There might be some other SKUs could be work. But we will your help to confirm/verify.

2. The docker file we used for Linux build has several security and license issues. We need people to fix the issue.  I have put the details in your last PR at [[MIGraphx EP] Sync AMD changes upstream  #25338 (comment)](https://github.com/microsoft/onnxruntime/pull/25338#issuecomment-3076634856)

If we could get a Windows build work, we don't need to bother with docker. And I think right now the integration work is mainly for Windows.

We require both windows and linux builds if we want to ensure proper integration as the linux environments were supported previously. What you're essentially saying by removing CI is that *Nix doesn't matter anymore in lieu of Windows which disregards previous releases.

TedThemistokleous avatar Jul 28 '25 16:07 TedThemistokleous

cc @skottmckay

TedThemistokleous avatar Aug 07 '25 23:08 TedThemistokleous

@snnn any update on this?

TedThemistokleous avatar Aug 15 '25 22:08 TedThemistokleous

Hey guys any update here?

TedThemistokleous avatar Sep 19 '25 21:09 TedThemistokleous

I am still in an active discussion with our capacity team and engineering system team. They are helping me. But, it seems that we are the first one at Microsoft who have this need. So there is no paved road.

snnn avatar Sep 19 '25 21:09 snnn

Is there anything further you require from AMD? We did have MIGraphX EP CI running before....I'm still wrapping my head around everything to be honest. We're currently not getting any CI coverage still for AMD related contributions.

TedThemistokleous avatar Sep 19 '25 22:09 TedThemistokleous

Hey @snnn did you get any updates on this? Anyway we can discuss or get a list of what's required here to get our CI reinstated?

TedThemistokleous avatar Sep 30 '25 02:09 TedThemistokleous

Hey just asking again, whats the status on this? Any thing further needed from AMD? Is the message here that you've just removed CI without a rhyme or reason now? Its been since July and we haven't gotten any clarity or answer on this.

TedThemistokleous avatar Oct 28 '25 19:10 TedThemistokleous

Hi @TedThemistokleous ,

Thank you for your continued contributions; your partnership is genuinely appreciated.

Following up on CI build issue, the ONNX Runtime team discussed this again last week. We are still finalizing the best approach and will let you know when we have an update.

Meanwhile, my request for MI300X GPUs was rejected as there is no capacity. As an alternative, does MIGraphX support the AMD Radeon™ Pro V710? That could be a more accessible option for us. If it is also not possible, we could add a pipeline that only compiles the code but do not run the tests.

One critical point I need to reiterate is that Windows support is our top priority. , and we will focus more on the consumer market than data center GPUs.

I will be on vacation till the end of this month. To make sure you get the support you need, please direct any future questions or PRs that need review to @devang-ml. He is the PoC for all EP-related work.

We look forward to continuing our collaboration.

p.s. I love AMD hardware. I personally have an AMD Radeon™ AI PRO R9700. I am willing to see ONNX Runtime could unleash the beast performance on AMD GPUs.

snnn avatar Nov 10 '25 17:11 snnn

Hi @TedThemistokleous ,

Thank you for your continued contributions; your partnership is genuinely appreciated.

Likewise to the feedback here.

Following up on CI build issue, the ONNX Runtime team discussed this again last week. We are still finalizing the best approach and will let you know when we have an update.

Meanwhile, my request for MI300X GPUs was rejected as there is no capacity. As an alternative, does MIGraphX support the AMD Radeon™ Pro V710? That could be a more accessible option for us. If it is also not possible, we could add a pipeline that only compiles the code but do not run the tests.

yes actually we have an effort for ROCm on Radeon - https://rocm.docs.amd.com/projects/radeon-ryzen/en/latest/index.html

V710 is and absolutely acceptable usecase here for development/testing/CI. We have various customers using V710 for their production work and bring up of systems/clusters. If the feature is on the V710s it certainly required to function for MI series cards as well. They should be interchangable.

One critical point I need to reiterate is that Windows support is our top priority. , and we will focus more on the consumer market than data center GPUs.

No problem at all, we do have an effort that supports Windows. Our UAI work internally tests and confirms that for windows Onnxruntime builds and grabs MIGraphX side changes to support the EP.

If you're building off MIGraphX develop/ appropriate release branches + Onnxruntime that should give you sufficient coverage. MIGraphX builds are cut off our release/rocm-rel-X.Y and all changes for OnnxRT builds are tested and pushed into rocmX.Y_internal_testing in rocm/Onnxruntime before I upstream them to your mainline. So if you select a rocm version as your base container, the apt install migraphx-dev will give you the blessed QA code from that corresponding release/rocm-rel-X.Y branch.

These should all build and work for windows/Linux as the code was refactored prior to support both. Same for the Onnxruntime side.

I will be on vacation till the end of this month. To make sure you get the support you need, please direct any future questions or PRs that need review to @devang-ml. He is the PoC for all EP-related work.

Will do, appreciate the coverage. I have a few more changes coming down the pipe for us. I'll make sure to tag @devang-ml

p.s. I love AMD hardware. I personally have an AMD Radeon™ AI PRO R9700. I am willing to see ONNX Runtime could unleash the beast performance on AMD GPUs.

<3

TedThemistokleous avatar Nov 10 '25 19:11 TedThemistokleous

@tianleiwu @snnn @skottmckay @devang-ml @hisham-hchowdhu

I've opened this issue recently - https://github.com/microsoft/onnxruntime/issues/26801

This is directly related to not having any CI coverage.

These are changes from the Microsoft side that's continually breaking our builds when we take changes off mainline.

TedThemistokleous avatar Dec 15 '25 19:12 TedThemistokleous