Benchmark for single-socket EPYC 9755
Maybe interesting to you
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 368904
NB : 456
PMAP : Row-major process mapping
P : 1
Q : 1
PFACT : Crout
NBMIN : 38
NDIV : 3
RFACT : Right
BCAST : IBcast
DEPTH : 0
SWAP : (211) Spread-roll
L1 : transposed form
A : no-transposed form (ColMajor)
EQUIL : no
ALIGN : 8 double precision words
MXSWP : (1/c) Collective
--------------------------------------------------------------------------------
- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0
================================================================================
T/V SWP N NB P Q Time Gflops
--------------------------------------------------------------------------------
WRC06R3C38c 211 368904 456 1 1 5777.05 5.7936e+03
HPL_pdgesv() start time Fri Sep 19 13:49:11 2025
HPL_pdgesv() end time Fri Sep 19 15:25:28 2025
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 4.89993517e-03 ...... PASSED
================================================================================
Finished 1 tests with the following results:
1 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------
End of Tests.
================================================================================
The measured power draw from the wall is pretty stable at 800W
Wow, 5.8 Tflops! A few questions:
- Did you run this from this repository's playbook, or with your own tuned installation?
- Have you tried any other configurations for Ps/Qs? It seems with different NUMA layouts, you might get better (or worse) performance?
- Is this running in a cloud environment, or on a system in your facility (e.g. Gigabyte or whatever rackmount server?)
I ran it from a tuned installation, the one AMD provides: https://www.amd.com/en/developer/zen-software-studio/applications/pre-built-applications.html
I'll also try to do it with yours!
That'd be awesome! I haven't had much time to test on AMD CPUs, so haven't looked into optimizing the setup for it.
To answer your other questions, it's running locally (the power measurement is from my UPS) and I haven't tried other P or Q values or NUMA, just the default NPS=1
Big fan of your videos, btw! :)
Thanks! That's a beast of a machine, fun to have local access to it. It'll be interesting to see how poorly the defaults in this repo do—I'm pretty sure AMD tunes a bit for AVX512 and other extensions they have enabled on that chip. Ampere also spent some time tweaking HPL, and I'd like to figure out how to get some of the vendor-specific tweaks documented better.
(Adding planned label as I'd like to add it to the result data later.)
Have you tried benchmarking on Graviton 4, btw? I'd expect its SVE implementation to be very good
Ok, I tried using this repository, using blis under the skx configuration (they don't have a zen4 or zen5 configuration; without that, it was going down a generic code path) and with P=8, Q=16:
================================================================================ 20:26:40 [29/1918]
HPLinpack 2.3 -- High-Performance Linpack benchmark -- December 2, 2018
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 368904
NB : 456
PMAP : Row-major process mapping
P : 8
Q : 16
PFACT : Crout
NBMIN : 38
NDIV : 3
RFACT : Right
BCAST : 1ringM
DEPTH : 0
SWAP : Spread-roll (long)
L1 : transposed form
U : transposed form
EQUIL : no
ALIGN : 8 double precision words
--------------------------------------------------------------------------------
- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR01R3C38 368904 456 8 16 10937.40 3.0601e+03
HPL_pdgesv() start time Fri Sep 19 20:26:49 2025
HPL_pdgesv() end time Fri Sep 19 23:29:06 2025
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 1.61175442e-03 ...... PASSED
================================================================================
Finished 1 tests with the following results:
1 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------
End of Tests.
================================================================================
So almost twice as slow :( I examined the AMD-provided sample more closely and it looks to be using AMD's fork of blis, spending the vast majority of its time in https://github.com/amd/blis/blob/master/kernels/zen4/3/bli_dgemm_avx512_asm_8x24.c . I'm guessing that's the biggest cause of the performance difference, but I'm not really sure.
Interesting, I wonder if their AVX512 optimizations could hit upstream BLIS? That seems to be the same issue I had with Ampere—they made some tweaks to help their NEON implementation speed up the runs, but it was all in an Oracle fork of OpenBLAS (IIRC)... and a bit hard to decipher exactly what was changed.
I'm tempted to put the number from this repo's setup as the 'official' result in the README, but I used the optimized run for Ampere's servers, so I guess we could just use the optimized result for AMD too.
(Could also be useful to modify the Ansible playbooks to allow for easier switching to the vendor-supplied libraries, like for AMD's BLIS fork in this case).
Yeah, I'm a little confused. My understanding is that on a single node, HPL figures should be close to the theoretical maximum FLOPs/second since it's mostly just spamming FMAs in level-3 BLAS rather than stressing the memory subsystem. Although, even AMD's tuned implementation is far off from the theoretical maximum. For your benchmarks on Apple, did you try using their Accelerate framework?
@anematode - No, for Macs, I am running in a VM, so missing some optimizations. See https://github.com/geerlingguy/top500-benchmark/issues/58