7950x results: with Intel OCL runtime with dot product ext supp. way slower than without it..
Hi, similar situation to M4.. i.e. CL runtime supporting cl_khr_integer_dot_product produces slower results new Intel OpenCL runtime for CPU 2025.1 supports cl_khr_integer_dot_product! (https://www.intel.com/content/www/us/en/developer/articles/release-notes/opencl-runtime-release-notes.html).. results on 7950x on 2025.1:
| INT8 compute 0.079 TIOPs/s (1/64) |
vs using older 2024 runtime not supporting it:
| INT8 compute 0.588 TIOPs/s (1/64) |
full results: 2025.1:
|----------------.------------------------------------------------------------|
| Device ID | 1 |
| Device Name | AMD Ryzen 9 7950X 16-Core Processor |
| Device Vendor | Intel(R) Corporation |
| Device Driver | 2025.19.3.0.17_230222 (Windows) |
| OpenCL Version | OpenCL C 3.0 |
| Compute Units | 32 at 0 MHz (16 cores, 0.000 TFLOPs/s) |
| Memory, Cache | 98026 MB RAM, 1024 KB global / 256 KB local |
| Buffer Limits | 98026 MB global, 128 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| FP64 compute 1.100 TFLOPs/s (1/64) |
| FP32 compute 1.309 TFLOPs/s (1/64) |
| FP16 compute 0.244 TFLOPs/s (1/64) |
| INT64 compute 0.538 TIOPs/s (1/64) |
| INT32 compute 1.270 TIOPs/s (1/64) |
| INT16 compute 2.589 TIOPs/s (1/64) |
| INT8 compute 0.079 TIOPs/s (1/64) |
| Memory Bandwidth ( coalesced read ) 50.96 GB/s |
| Memory Bandwidth ( coalesced write) 27.70 GB/s |
| Memory Bandwidth (misaligned read ) 60.71 GB/s |
| Memory Bandwidth (misaligned write) 30.80 GB/s |
|-----------------------------------------------------------------------------|
2024.x:
|----------------.------------------------------------------------------------|
| Device ID | 1 |
| Device Name | AMD Ryzen 9 7950X 16-Core Processor |
| Device Vendor | Intel(R) Corporation |
| Device Driver | 2024.17.3.0.08_160000 (Windows) |
| OpenCL Version | OpenCL C 3.0 |
| Compute Units | 32 at 0 MHz (16 cores, 0.000 TFLOPs/s) |
| Memory, Cache | 98026 MB RAM, 1024 KB global / 32 KB local |
| Buffer Limits | 98026 MB global, 128 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| FP64 compute 0.979 TFLOPs/s (1/64) |
| FP32 compute 1.175 TFLOPs/s (1/64) |
| FP16 compute not supported |
| INT64 compute 0.303 TIOPs/s (1/64) |
| INT32 compute 1.225 TIOPs/s (1/64) |
| INT16 compute 2.323 TIOPs/s (1/64) |
| INT8 compute 0.588 TIOPs/s (1/64) |
| Memory Bandwidth ( coalesced read ) 49.40 GB/s |
| Memory Bandwidth ( coalesced write) 27.05 GB/s |
| Memory Bandwidth (misaligned read ) 59.32 GB/s |
| Memory Bandwidth (misaligned write) 30.57 GB/s |
|-----------------------------------------------------------------------------|
Hi @oscarbg,
thanks for sharing this finding! I can reproduce the bad native dp4a performance with CPU Runtime release 2025.1 on my i7-8700K system. The newly added native dp4a instruction performs much slower than fallback emulation. And on Windows it even fails to compile the dot(char4, char4) function on my system. I have raised this issue to CPU Runtime repository: https://github.com/intel/llvm/issues/18212#issue-3022801845
Will keep you updated here.
Kind regards, Moritz
Hi @ProjectPhysX, thanks for taking the time to verify.. great that I'm not the only one having issues (being on AMD platform, thought could be AMD specific).. many thanks for reporting appropiately to Intel OpenCL devs.. I see you are having "hot discussions" :-), over there.. once "they" fix it and can verify I will close issue..
Kind regards, Oscar
Hi @oscarbg,
I have now disabled the native dp4a in Intel CPU Runtime for OpenCL, on application-side :) It will now use the faster dp4a emulation again on Intel/AMD CPUs.
Kind regards, Moritz