OpenCL-Benchmark icon indicating copy to clipboard operation
OpenCL-Benchmark copied to clipboard

7950x results: with Intel OCL runtime with dot product ext supp. way slower than without it..

Open oscarbg opened this issue 8 months ago • 3 comments

Hi, similar situation to M4.. i.e. CL runtime supporting cl_khr_integer_dot_product produces slower results new Intel OpenCL runtime for CPU 2025.1 supports cl_khr_integer_dot_product! (https://www.intel.com/content/www/us/en/developer/articles/release-notes/opencl-runtime-release-notes.html).. results on 7950x on 2025.1:

| INT8 compute 0.079 TIOPs/s (1/64) |

vs using older 2024 runtime not supporting it:

| INT8 compute 0.588 TIOPs/s (1/64) |

full results: 2025.1:

|----------------.------------------------------------------------------------|
| Device ID      | 1                                                          |
| Device Name    | AMD Ryzen 9 7950X 16-Core Processor                        |
| Device Vendor  | Intel(R) Corporation                                       |
| Device Driver  | 2025.19.3.0.17_230222 (Windows)                            |
| OpenCL Version | OpenCL C 3.0                                               |
| Compute Units  | 32 at 0 MHz (16 cores, 0.000 TFLOPs/s)                     |
| Memory, Cache  | 98026 MB RAM, 1024 KB global / 256 KB local                |
| Buffer Limits  | 98026 MB global, 128 KB constant                           |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| FP64  compute                                         1.100 TFLOPs/s (1/64) |
| FP32  compute                                         1.309 TFLOPs/s (1/64) |
| FP16  compute                                         0.244 TFLOPs/s (1/64) |
| INT64 compute                                         0.538  TIOPs/s (1/64) |
| INT32 compute                                         1.270  TIOPs/s (1/64) |
| INT16 compute                                         2.589  TIOPs/s (1/64) |
| INT8  compute                                         0.079  TIOPs/s (1/64) |
| Memory Bandwidth ( coalesced read      )                         50.96 GB/s |
| Memory Bandwidth ( coalesced      write)                         27.70 GB/s |
| Memory Bandwidth (misaligned read      )                         60.71 GB/s |
| Memory Bandwidth (misaligned      write)                         30.80 GB/s |
|-----------------------------------------------------------------------------|

2024.x:

|----------------.------------------------------------------------------------|
| Device ID      | 1                                                          |
| Device Name    | AMD Ryzen 9 7950X 16-Core Processor                        |
| Device Vendor  | Intel(R) Corporation                                       |
| Device Driver  | 2024.17.3.0.08_160000 (Windows)                            |
| OpenCL Version | OpenCL C 3.0                                               |
| Compute Units  | 32 at 0 MHz (16 cores, 0.000 TFLOPs/s)                     |
| Memory, Cache  | 98026 MB RAM, 1024 KB global / 32 KB local                 |
| Buffer Limits  | 98026 MB global, 128 KB constant                           |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| FP64  compute                                         0.979 TFLOPs/s (1/64) |
| FP32  compute                                         1.175 TFLOPs/s (1/64) |
| FP16  compute                                          not supported        |
| INT64 compute                                         0.303  TIOPs/s (1/64) |
| INT32 compute                                         1.225  TIOPs/s (1/64) |
| INT16 compute                                         2.323  TIOPs/s (1/64) |
| INT8  compute                                         0.588  TIOPs/s (1/64) |
| Memory Bandwidth ( coalesced read      )                         49.40 GB/s |
| Memory Bandwidth ( coalesced      write)                         27.05 GB/s |
| Memory Bandwidth (misaligned read      )                         59.32 GB/s |
| Memory Bandwidth (misaligned      write)                         30.57 GB/s |
|-----------------------------------------------------------------------------|

oscarbg avatar Apr 10 '25 09:04 oscarbg

Hi @oscarbg,

thanks for sharing this finding! I can reproduce the bad native dp4a performance with CPU Runtime release 2025.1 on my i7-8700K system. The newly added native dp4a instruction performs much slower than fallback emulation. And on Windows it even fails to compile the dot(char4, char4) function on my system. I have raised this issue to CPU Runtime repository: https://github.com/intel/llvm/issues/18212#issue-3022801845

Will keep you updated here.

Kind regards, Moritz

ProjectPhysX avatar Apr 27 '25 06:04 ProjectPhysX

Hi @ProjectPhysX, thanks for taking the time to verify.. great that I'm not the only one having issues (being on AMD platform, thought could be AMD specific).. many thanks for reporting appropiately to Intel OpenCL devs.. I see you are having "hot discussions" :-), over there.. once "they" fix it and can verify I will close issue..

Kind regards, Oscar

oscarbg avatar May 01 '25 17:05 oscarbg

Hi @oscarbg,

I have now disabled the native dp4a in Intel CPU Runtime for OpenCL, on application-side :) It will now use the faster dp4a emulation again on Intel/AMD CPUs.

Kind regards, Moritz

ProjectPhysX avatar May 17 '25 07:05 ProjectPhysX