OpenCL-Benchmark Question: How to get good AMD CPU results?

Hi,

I REALLY like this benchmark.

So much so that I plan to (most likely) use its results to make roofline plots in an upcoming paper (I will cite it as shown in README).

However, I am having issues getting proper results on AMD CPUs.

I have seen that AMD dropped all official OpenCL support for their CPUs.

I am able to still run the benchmark if I load the Intel OneAPI environment, but I get funky CPU info and the results do not seem right compared to other similar Intel CPUs.

For example, on an EPYC 7742 dual-socket system, it only detects one of the CPUs and says:

-----------------------------------------------------------------------------.
|----------------.------------------------------------------------------------|
| Device ID    0 | AMD EPYC 7742 64-Core Processor                            |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | AMD EPYC 7742 64-Core Processor                            |
| Device Vendor  | Intel(R) Corporation                                       |
| Device Driver  | 2024.18.6.0.02_160000 (Linux)                              |
| OpenCL Version | OpenCL C 3.0                                               |
| Compute Units  | 64 at 0 MHz (32 cores, 0.000 TFLOPs/s)                     |
| Memory, Cache  | 127842 MB, 512 KB global / 32 KB local                     |
| Buffer Limits  | 63921 MB global, 128 KB constant

The 0 MHz is concerning.

Then, the results seem quite a bit slower than they should be: FP64 compute 0.022 TFLOPs/s (1/64)

For example, on the EPYC 7702P (a slower CPU) with the Ubuntu opencl runtime I get: | FP64 compute 1.111 TFLOPs/s (1/64) | but it still reports 0 MHz in the info.

I really like the suggestions for installing the OpenCL runtime that the compilation spits out, but on the supercomputer I cannot install those packages to try the open source OpenCL. Is there some kind of pre-built OpenCL run time binaries that I could point to that work well on AMD CPUs?

Is there a way to fix the CPU identification to know its AMD not Intel and get the correct mHz?

Thanks!

Ron

Jul 26 '24 18:07 sumseq

Hi @sumseq,

The 0 MHz is just a cosmetic information. The Intel CPU Runtime for OpenCL internally uses a lookup-table to report CL_DEVICE_MAX_CLOCK_FREQUENCY, and for AMD CPUs there is simply no data in there.
The Intel(R) Corporation returned by CL_DEVICE_VENDOR is also just purely cosmetic.
Both 64-Core CPUs should be detected on a dual-socket system, and show up as a single OpenCL device with 256 compute units (2 CPUs * 2 threads/core * 64 cores). Check that your slurm reservation allocates the full node with both CPUs, and check if SMT is enabled. Don't forget the --exclusive flag for slurm reservation.
```
srun --nodes=1 --exclusive --time=01:00:00 --pty bash
```
I can reproduce the poor performance bahavior on dual EPYC 7302, 7313, and 7352 systems. The kernels are vectorized to AVX2, which is good. Manually turning off vectorization with export CL_CONFIG_USE_VECTORIZER=false reduces performance by ~7.9x, so the vectorization is also working as intended.
It's possible that there is special optimizations for AMD's microarchitecture that the Intel Runtime does not fully exploit. An alternative here is to use PoCL. On all of the Intel CPUs I've tested, the Intel Runtime is a lot faster than PoCL, and PoCL itself is transitioning from their in-house threading library to Intel TBB, which the Intel Runtime uses. It's possible that on AMD systems, PoCL might be faster. But all the modern AMD EPYC systems I have access to at university unfortunately don't have PoCL installed and I don't have sudo permissions, so I can't test if PoCL is faster. However in the coming weeks I'll get access to a dual EPYC 9754 system with sudo permissions to test this. I'll keep you updated.

Kind regards, Moritz

Jul 27 '24 13:07 ProjectPhysX

Update:

I have access to the 7742 on another supercomputer that seems to have an OpenCL runtime installed.

The benchmark seems to be using the CUDA x86 OpenCL library:

/nasa/nvidia/hpc_sdk/toss4/Linux_x86_64/24.1/cuda/lib64/libOpenCL.so.1 (0x0000145e3d859000)

However, when I try using the CUDA library on the other supercomputer the benchmark still says it cannot find the device so I think the PoCL is still needed for device identification?

Anyways, I get the following result on the machine that worked:

|----------------.------------------------------------------------------------|
| Device ID    0 | AMD EPYC 7742 64-Core Processor                            |
| Device ID    1 | Intel(R) FPGA Emulation Device                             |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | AMD EPYC 7742 64-Core Processor                            |
| Device Vendor  | Intel(R) Corporation                                       |
| Device Driver  | 2023.16.7.0.21_160000 (Linux)                              |
| OpenCL Version | OpenCL C 3.0                                               |
| Compute Units  | 128 at 0 MHz (64 cores, 0.000 TFLOPs/s)                    |
| Memory, Cache  | 515280 MB, 512 KB global / 32 KB local                     |
| Buffer Limits  | 257640 MB global, 128 KB constant                          |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| FP64  compute                                         1.370 TFLOPs/s (1/64) |
| FP32  compute                                         1.379 TFLOPs/s (1/64) |
| FP16  compute                                          not supported        |
| INT64 compute                                         0.101  TIOPs/s (1/64) |
| INT32 compute                                         1.541  TIOPs/s (1/64) |
| INT16 compute                                         2.892  TIOPs/s (1/64) |
| INT8  compute                                         2.848  TIOPs/s (1/64) |
| Memory Bandwidth ( coalesced read      )                         14.36 GB/s |
| Memory Bandwidth ( coalesced      write)                         17.94 GB/s |
| Memory Bandwidth (misaligned read      )                         33.05 GB/s |
| Memory Bandwidth (misaligned      write)                         20.66 GB/s |
| PCIe   Bandwidth (send                 )                         16.30 GB/s |
| PCIe   Bandwidth (   receive           )                         17.46 GB/s |
| PCIe   Bandwidth (        bidirectional)            (Gen4 x16)    9.14 GB/s |
|-----------------------------------------------------------------------------|

This is on a dual-socket node with hyper-threading disabled.

The results for the "FPGA" device are identical to those above, leading me to think that it is the other CPU socket, but being misidentified?

The TFLOPs look a lot better but I was expecting more bandwidth (since the peak is 208 GB/s).

Ron

Jul 29 '24 23:07 sumseq

I was able to run it with PoCL using a singularity container. It now detects the CPU correctly but the results are still not great:

.-----------------------------------------------------------------------------.
|----------------.------------------------------------------------------------|
| Device ID    0 | pthread-AMD EPYC 7742 64-Core Processor                    |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | pthread-AMD EPYC 7742 64-Core Processor                    |
| Device Vendor  | AuthenticAMD                                               |
| Device Driver  | 1.4 (Linux)                                                |
| OpenCL Version | OpenCL C 1.2 pocl                                          |
| Compute Units  | 128 at 2245 MHz (64 cores, 4.598 TFLOPs/s)                 |
| Memory, Cache  | 255437 MB, 16384 KB global / 8192 KB local                 |
| Buffer Limits  | 65536 MB global, 8192 KB constant                          |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| FP64  compute                                         0.104 TFLOPs/s (1/64) |
| FP32  compute                                         0.105 TFLOPs/s (1/64) |
| FP16  compute                                          not supported        |
| INT64 compute                                         0.199  TIOPs/s (1/24) |
| INT32 compute                                         0.217  TIOPs/s (1/24) |
| INT16 compute                                         0.444  TIOPs/s (1/12) |
| INT8  compute                                         0.741  TIOPs/s (1/8 ) |
| Memory Bandwidth ( coalesced read      )                         16.99 GB/s |
| Memory Bandwidth ( coalesced      write)                         23.69 GB/s |
| Memory Bandwidth (misaligned read      )                         91.84 GB/s |
| Memory Bandwidth (misaligned      write)                         49.60 GB/s |
| PCIe   Bandwidth (send                 )                         16.17 GB/s |
| PCIe   Bandwidth (   receive           )                         13.19 GB/s |
| PCIe   Bandwidth (        bidirectional)            (Gen4 x16)   14.69 GB/s |
|-----------------------------------------------------------------------------|

Jul 31 '24 01:07 sumseq

Hi @sumseq,

I've tested a 2x EPYC 9754 system today. The Intel CPU Runtime for OpenCL is way faster than PoCL on this system too.

Kind regards, Moritz

Aug 23 '24 10:08 ProjectPhysX

Results stitched together for two different versions of the Windows AMD runtime, using OpenCL-Benchmark v1.8:

.-----------------------------------------------------------------------------.
|----------------.------------------------------------------------------------|
| Device ID    0 | NVIDIA GeForce RTX 2060                                    |
| Device ID    1 | AMD Ryzen 5 3600 6-Core Processor                          |
| Device ID    2 | AMD Ryzen 5 3600 6-Core Processor                          |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 1                                                          |
| Device Name    | AMD Ryzen 5 3600 6-Core Processor                          |
| Device Vendor  | Intel(R) Corporation                                       |
| Device Driver  | 2022.14.8.0.04_160000 (Windows)                            |
| OpenCL Version | OpenCL C 3.0                                               |
| Compute Units  | 12 at 0 MHz (6 cores, 0.000 TFLOPs/s)                      |
| Memory, Cache  | 24501 MB RAM, 512 KB global / 32 KB local                  |
| Buffer Limits  | 24501 MB global, 128 KB constant                           |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| FP64  compute                                         0.146 TFLOPs/s (1/64) |
| FP32  compute                                         0.146 TFLOPs/s (1/64) |
| FP16  compute                                          not supported        |
| INT64 compute                                         0.059  TIOPs/s (1/64) |
| INT32 compute                                         0.145  TIOPs/s (1/64) |
| INT16 compute                                         0.369  TIOPs/s (1/64) |
| INT8  compute                                         0.084  TIOPs/s (1/64) |
| Memory Bandwidth ( coalesced read      )                         33.79 GB/s |
| Memory Bandwidth ( coalesced      write)                         15.37 GB/s |
| Memory Bandwidth (misaligned read      )                         34.63 GB/s |
| Memory Bandwidth (misaligned      write)                         16.07 GB/s |
|-----------------------------------------------------------------------------|
|-----------------------------------------------------------------------------|
| Done. Press Enter to exit.                                                  |
'-----------------------------------------------------------------------------'
|----------------.------------------------------------------------------------|
| Device ID      | 2                                                          |
| Device Name    | AMD Ryzen 5 3600 6-Core Processor                          |
| Device Vendor  | Intel(R) Corporation                                       |
| Device Driver  | 2025.20.6.0.04_224945 (Windows)                            |
| OpenCL Version | OpenCL C 3.0                                               |
| Compute Units  | 12 at 0 MHz (6 cores, 0.000 TFLOPs/s)                      |
| Memory, Cache  | 24501 MB RAM, 512 KB global / 256 KB local                 |
| Buffer Limits  | 24501 MB global, 128 KB constant                           |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| FP64  compute                                         0.146 TFLOPs/s (1/64) |
| FP32  compute                                         0.144 TFLOPs/s (1/64) |
| FP16  compute                                         0.047 TFLOPs/s (1/64) |
| INT64 compute                                         0.059  TIOPs/s (1/64) |
| INT32 compute                                         0.162  TIOPs/s (1/64) |
| INT16 compute                                         0.422  TIOPs/s (1/64) |
| INT8  compute                                         0.022  TIOPs/s (1/64) |
| Memory Bandwidth ( coalesced read      )                          1.75 GB/s |
| Memory Bandwidth ( coalesced      write)                          2.96 GB/s |
| Memory Bandwidth (misaligned read      )                        129.34 GB/s |
| Memory Bandwidth (misaligned      write)                         18.58 GB/s |
|-----------------------------------------------------------------------------|
|-----------------------------------------------------------------------------|
| Done. Press Enter to exit.                                                  |
'-----------------------------------------------------------------------------'

The dp4a thing is a known issue. The memory bandwidth... no idea what happened there. I guess you can play around with the runtime environment variables documented on the download page to make sure the vectorizer is doing its job and using the right target arch, but I wouldn't expect too much improvement.

If you want to get the correct "vendor" and frequency my recommendation would be to mess with Intel's open source LLVM code and add a bit that uses the cpuid instruction to get the vendor and the max freq, like what pocl is probably doing.

Sep 02 '25 09:09 Artoria2e5

OpenCL-Benchmark OpenCL-Benchmark copied to clipboard

Question: How to get good AMD CPU results?

OpenCL-Benchmark
OpenCL-Benchmark copied to clipboard