OpenCL-Benchmark
OpenCL-Benchmark copied to clipboard
Question: How to get good AMD CPU results?
Hi,
I REALLY like this benchmark.
So much so that I plan to (most likely) use its results to make roofline plots in an upcoming paper (I will cite it as shown in README).
However, I am having issues getting proper results on AMD CPUs.
I have seen that AMD dropped all official OpenCL support for their CPUs.
I am able to still run the benchmark if I load the Intel OneAPI environment, but I get funky CPU info and the results do not seem right compared to other similar Intel CPUs.
For example, on an EPYC 7742 dual-socket system, it only detects one of the CPUs and says:
-----------------------------------------------------------------------------.
|----------------.------------------------------------------------------------|
| Device ID 0 | AMD EPYC 7742 64-Core Processor |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | AMD EPYC 7742 64-Core Processor |
| Device Vendor | Intel(R) Corporation |
| Device Driver | 2024.18.6.0.02_160000 (Linux) |
| OpenCL Version | OpenCL C 3.0 |
| Compute Units | 64 at 0 MHz (32 cores, 0.000 TFLOPs/s) |
| Memory, Cache | 127842 MB, 512 KB global / 32 KB local |
| Buffer Limits | 63921 MB global, 128 KB constant
The 0 MHz is concerning.
Then, the results seem quite a bit slower than they should be: FP64 compute 0.022 TFLOPs/s (1/64)
For example, on the EPYC 7702P (a slower CPU) with the Ubuntu opencl runtime I get: | FP64 compute 1.111 TFLOPs/s (1/64) | but it still reports 0 MHz in the info.
I really like the suggestions for installing the OpenCL runtime that the compilation spits out, but on the supercomputer I cannot install those packages to try the open source OpenCL. Is there some kind of pre-built OpenCL run time binaries that I could point to that work well on AMD CPUs?
Is there a way to fix the CPU identification to know its AMD not Intel and get the correct mHz?
Thanks!
- Ron
Hi @sumseq,
- The
0 MHzis just a cosmetic information. TheIntel CPU Runtime for OpenCLinternally uses a lookup-table to reportCL_DEVICE_MAX_CLOCK_FREQUENCY, and for AMD CPUs there is simply no data in there. - The
Intel(R) Corporationreturned byCL_DEVICE_VENDORis also just purely cosmetic. - Both 64-Core CPUs should be detected on a dual-socket system, and show up as a single OpenCL device with 256 compute units (2 CPUs * 2 threads/core * 64 cores). Check that your slurm reservation allocates the full node with both CPUs, and check if SMT is enabled. Don't forget the
--exclusiveflag for slurm reservation.srun --nodes=1 --exclusive --time=01:00:00 --pty bash - I can reproduce the poor performance bahavior on dual EPYC 7302, 7313, and 7352 systems. The kernels are vectorized to AVX2, which is good. Manually turning off vectorization with
export CL_CONFIG_USE_VECTORIZER=falsereduces performance by ~7.9x, so the vectorization is also working as intended. - It's possible that there is special optimizations for AMD's microarchitecture that the Intel Runtime does not fully exploit. An alternative here is to use PoCL. On all of the Intel CPUs I've tested, the Intel Runtime is a lot faster than PoCL, and PoCL itself is transitioning from their in-house threading library to Intel TBB, which the Intel Runtime uses. It's possible that on AMD systems, PoCL might be faster. But all the modern AMD EPYC systems I have access to at university unfortunately don't have PoCL installed and I don't have
sudopermissions, so I can't test if PoCL is faster. However in the coming weeks I'll get access to a dual EPYC 9754 system withsudopermissions to test this. I'll keep you updated.
Kind regards, Moritz
Update:
I have access to the 7742 on another supercomputer that seems to have an OpenCL runtime installed.
The benchmark seems to be using the CUDA x86 OpenCL library:
/nasa/nvidia/hpc_sdk/toss4/Linux_x86_64/24.1/cuda/lib64/libOpenCL.so.1 (0x0000145e3d859000)
However, when I try using the CUDA library on the other supercomputer the benchmark still says it cannot find the device so I think the PoCL is still needed for device identification?
Anyways, I get the following result on the machine that worked:
|----------------.------------------------------------------------------------|
| Device ID 0 | AMD EPYC 7742 64-Core Processor |
| Device ID 1 | Intel(R) FPGA Emulation Device |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | AMD EPYC 7742 64-Core Processor |
| Device Vendor | Intel(R) Corporation |
| Device Driver | 2023.16.7.0.21_160000 (Linux) |
| OpenCL Version | OpenCL C 3.0 |
| Compute Units | 128 at 0 MHz (64 cores, 0.000 TFLOPs/s) |
| Memory, Cache | 515280 MB, 512 KB global / 32 KB local |
| Buffer Limits | 257640 MB global, 128 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| FP64 compute 1.370 TFLOPs/s (1/64) |
| FP32 compute 1.379 TFLOPs/s (1/64) |
| FP16 compute not supported |
| INT64 compute 0.101 TIOPs/s (1/64) |
| INT32 compute 1.541 TIOPs/s (1/64) |
| INT16 compute 2.892 TIOPs/s (1/64) |
| INT8 compute 2.848 TIOPs/s (1/64) |
| Memory Bandwidth ( coalesced read ) 14.36 GB/s |
| Memory Bandwidth ( coalesced write) 17.94 GB/s |
| Memory Bandwidth (misaligned read ) 33.05 GB/s |
| Memory Bandwidth (misaligned write) 20.66 GB/s |
| PCIe Bandwidth (send ) 16.30 GB/s |
| PCIe Bandwidth ( receive ) 17.46 GB/s |
| PCIe Bandwidth ( bidirectional) (Gen4 x16) 9.14 GB/s |
|-----------------------------------------------------------------------------|
This is on a dual-socket node with hyper-threading disabled.
The results for the "FPGA" device are identical to those above, leading me to think that it is the other CPU socket, but being misidentified?
The TFLOPs look a lot better but I was expecting more bandwidth (since the peak is 208 GB/s).
- Ron
I was able to run it with PoCL using a singularity container. It now detects the CPU correctly but the results are still not great:
.-----------------------------------------------------------------------------.
|----------------.------------------------------------------------------------|
| Device ID 0 | pthread-AMD EPYC 7742 64-Core Processor |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | pthread-AMD EPYC 7742 64-Core Processor |
| Device Vendor | AuthenticAMD |
| Device Driver | 1.4 (Linux) |
| OpenCL Version | OpenCL C 1.2 pocl |
| Compute Units | 128 at 2245 MHz (64 cores, 4.598 TFLOPs/s) |
| Memory, Cache | 255437 MB, 16384 KB global / 8192 KB local |
| Buffer Limits | 65536 MB global, 8192 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| FP64 compute 0.104 TFLOPs/s (1/64) |
| FP32 compute 0.105 TFLOPs/s (1/64) |
| FP16 compute not supported |
| INT64 compute 0.199 TIOPs/s (1/24) |
| INT32 compute 0.217 TIOPs/s (1/24) |
| INT16 compute 0.444 TIOPs/s (1/12) |
| INT8 compute 0.741 TIOPs/s (1/8 ) |
| Memory Bandwidth ( coalesced read ) 16.99 GB/s |
| Memory Bandwidth ( coalesced write) 23.69 GB/s |
| Memory Bandwidth (misaligned read ) 91.84 GB/s |
| Memory Bandwidth (misaligned write) 49.60 GB/s |
| PCIe Bandwidth (send ) 16.17 GB/s |
| PCIe Bandwidth ( receive ) 13.19 GB/s |
| PCIe Bandwidth ( bidirectional) (Gen4 x16) 14.69 GB/s |
|-----------------------------------------------------------------------------|
Hi @sumseq,
I've tested a 2x EPYC 9754 system today. The Intel CPU Runtime for OpenCL is way faster than PoCL on this system too.
Kind regards, Moritz
Results stitched together for two different versions of the Windows AMD runtime, using OpenCL-Benchmark v1.8:
.-----------------------------------------------------------------------------.
|----------------.------------------------------------------------------------|
| Device ID 0 | NVIDIA GeForce RTX 2060 |
| Device ID 1 | AMD Ryzen 5 3600 6-Core Processor |
| Device ID 2 | AMD Ryzen 5 3600 6-Core Processor |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 1 |
| Device Name | AMD Ryzen 5 3600 6-Core Processor |
| Device Vendor | Intel(R) Corporation |
| Device Driver | 2022.14.8.0.04_160000 (Windows) |
| OpenCL Version | OpenCL C 3.0 |
| Compute Units | 12 at 0 MHz (6 cores, 0.000 TFLOPs/s) |
| Memory, Cache | 24501 MB RAM, 512 KB global / 32 KB local |
| Buffer Limits | 24501 MB global, 128 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| FP64 compute 0.146 TFLOPs/s (1/64) |
| FP32 compute 0.146 TFLOPs/s (1/64) |
| FP16 compute not supported |
| INT64 compute 0.059 TIOPs/s (1/64) |
| INT32 compute 0.145 TIOPs/s (1/64) |
| INT16 compute 0.369 TIOPs/s (1/64) |
| INT8 compute 0.084 TIOPs/s (1/64) |
| Memory Bandwidth ( coalesced read ) 33.79 GB/s |
| Memory Bandwidth ( coalesced write) 15.37 GB/s |
| Memory Bandwidth (misaligned read ) 34.63 GB/s |
| Memory Bandwidth (misaligned write) 16.07 GB/s |
|-----------------------------------------------------------------------------|
|-----------------------------------------------------------------------------|
| Done. Press Enter to exit. |
'-----------------------------------------------------------------------------'
|----------------.------------------------------------------------------------|
| Device ID | 2 |
| Device Name | AMD Ryzen 5 3600 6-Core Processor |
| Device Vendor | Intel(R) Corporation |
| Device Driver | 2025.20.6.0.04_224945 (Windows) |
| OpenCL Version | OpenCL C 3.0 |
| Compute Units | 12 at 0 MHz (6 cores, 0.000 TFLOPs/s) |
| Memory, Cache | 24501 MB RAM, 512 KB global / 256 KB local |
| Buffer Limits | 24501 MB global, 128 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| FP64 compute 0.146 TFLOPs/s (1/64) |
| FP32 compute 0.144 TFLOPs/s (1/64) |
| FP16 compute 0.047 TFLOPs/s (1/64) |
| INT64 compute 0.059 TIOPs/s (1/64) |
| INT32 compute 0.162 TIOPs/s (1/64) |
| INT16 compute 0.422 TIOPs/s (1/64) |
| INT8 compute 0.022 TIOPs/s (1/64) |
| Memory Bandwidth ( coalesced read ) 1.75 GB/s |
| Memory Bandwidth ( coalesced write) 2.96 GB/s |
| Memory Bandwidth (misaligned read ) 129.34 GB/s |
| Memory Bandwidth (misaligned write) 18.58 GB/s |
|-----------------------------------------------------------------------------|
|-----------------------------------------------------------------------------|
| Done. Press Enter to exit. |
'-----------------------------------------------------------------------------'
The dp4a thing is a known issue. The memory bandwidth... no idea what happened there. I guess you can play around with the runtime environment variables documented on the download page to make sure the vectorizer is doing its job and using the right target arch, but I wouldn't expect too much improvement.
If you want to get the correct "vendor" and frequency my recommendation would be to mess with Intel's open source LLVM code and add a bit that uses the cpuid instruction to get the vendor and the max freq, like what pocl is probably doing.