onnxruntime Add OpenCL EP

Description

Description: This PR adds OpenCL execution provider support. Enable ONNX model execution across various accelerators by leveraging OpenCL support. 1.Enhance OpenCL EP support based on OpenCL 1.2 standard. 2.Add basic operator support for LLMs, such as Qwen2_5 and Llama2_7B_Chat. 3.Add C/C++、Python API support.

Motivation and Context

Feb 27 '25 02:02 luyhcsu

Any maintainer can apporve the CI workflows? Thanks

Mar 17 '25 01:03 luyhcsu

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,ONNX Runtime Web CI Pipeline,Windows ARM64 QNN CI Pipeline

Apr 17 '25 21:04 justinchuby

Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.

Apr 17 '25 21:04 azure-pipelines[bot]

/azp run Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,Windows x64 QNN CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,Windows GPU Doc Gen CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU CUDA CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Linux Android Emulator QNN CI Pipeline,Big Models

Apr 17 '25 21:04 justinchuby

Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.

Apr 17 '25 21:04 azure-pipelines[bot]

Could you rebase from main?

Apr 17 '25 21:04 justinchuby

@luyhcsu Are you still aiming to merge this in? If not, do you mind if someone else picks it up?

Jun 09 '25 17:06 mitchelldehaven

@luyhcsu Are you still aiming to merge this in? If not, do you mind if someone else picks it up?

Appreciate you checking in! I do plan to finish this—just need a bit more time to tidy things up. Totally open to others jumping in with suggestions or improvements too.

Jun 10 '25 10:06 luyhcsu

It should be pointed out this is a continuous effort based #10830. I am the original feature branch main developer during my work at Microsoft (not anymore).

Let me add some context for this.

Almost every OpenCL C compilers are bug ridden c compilers embedded in the kernel driver.
- except for the compiler from NVIDIA, not /s!
- be prepared for the workarounds!
The pointer and opaque Texture/Image type design was not finished. There is no cached allocator for Texture/Image backed opaque tensor, only a very simple one with LRU cache.
- This is the main pain point.
- Some hardware vendor do not know how to properly design a GPU (ARM! the Mali GPU). The shared memory (in CUDA context, or local memory in OpenCL context) is backed by DRAM! Programmers lose the programmer controlled cache. So it is impossible to design high performance kernels with flat pointer. So to get some cache ability, some cleverly use the texture cache from texture unit, but it is awful to encode and decode the tensor coords.
The original branch was not merged because we came to the conclusion that there is no enough man-power (FTE) to maintain another EP.

Jun 24 '25 09:06 cloudhan