Add OpenCL EP
Description
Description: This PR adds OpenCL execution provider support. Enable ONNX model execution across various accelerators by leveraging OpenCL support. 1.Enhance OpenCL EP support based on OpenCL 1.2 standard. 2.Add basic operator support for LLMs, such as Qwen2_5 and Llama2_7B_Chat. 3.Add C/C++、Python API support.
Motivation and Context
Any maintainer can apporve the CI workflows? Thanks
/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,ONNX Runtime Web CI Pipeline,Windows ARM64 QNN CI Pipeline
Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.
/azp run Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,Windows x64 QNN CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,Windows GPU Doc Gen CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU CUDA CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Linux Android Emulator QNN CI Pipeline,Big Models
Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.
Could you rebase from main?
@luyhcsu Are you still aiming to merge this in? If not, do you mind if someone else picks it up?
@luyhcsu Are you still aiming to merge this in? If not, do you mind if someone else picks it up?
Appreciate you checking in! I do plan to finish this—just need a bit more time to tidy things up. Totally open to others jumping in with suggestions or improvements too.
It should be pointed out this is a continuous effort based #10830. I am the original feature branch main developer during my work at Microsoft (not anymore).
Let me add some context for this.
- Almost every OpenCL C compilers are bug ridden c compilers embedded in the kernel driver.
- except for the compiler from NVIDIA, not /s!
- be prepared for the workarounds!
- The pointer and opaque Texture/Image type design was not finished. There is no cached allocator for Texture/Image backed opaque tensor, only a very simple one with LRU cache.
- This is the main pain point.
- Some hardware vendor do not know how to properly design a GPU (ARM! the Mali GPU). The shared memory (in CUDA context, or local memory in OpenCL context) is backed by DRAM! Programmers lose the programmer controlled cache. So it is impossible to design high performance kernels with flat pointer. So to get some cache ability, some cleverly use the texture cache from texture unit, but it is awful to encode and decode the tensor coords.
- The original branch was not merged because we came to the conclusion that there is no enough man-power (FTE) to maintain another EP.