top500-benchmark Benchmark for single-socket EPYC 9755

Maybe interesting to you

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :  368904
NB     :     456
PMAP   : Row-major process mapping
P      :       1
Q      :       1
PFACT  :   Crout
NBMIN  :      38
NDIV   :       3
RFACT  :   Right
BCAST  :  IBcast
DEPTH  :       0                                                                                                                 
SWAP   : (211) Spread-roll                                                                                                       
L1     : transposed form                                                                                                         
A      : no-transposed form (ColMajor)                                                                                           
EQUIL  : no                                                                                                                      
ALIGN  : 8 double precision words
MXSWP  : (1/c) Collective 

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V         SWP           N    NB     P     Q            Time             Gflops
--------------------------------------------------------------------------------
WRC06R3C38c 211       368904   456     1     1         5777.05         5.7936e+03
HPL_pdgesv() start time Fri Sep 19 13:49:11 2025
HPL_pdgesv() end time   Fri Sep 19 15:25:28 2025
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   4.89993517e-03 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

The measured power draw from the wall is pretty stable at 800W

Sep 19 '25 23:09 anematode

Wow, 5.8 Tflops! A few questions:

Did you run this from this repository's playbook, or with your own tuned installation?
Have you tried any other configurations for Ps/Qs? It seems with different NUMA layouts, you might get better (or worse) performance?
Is this running in a cloud environment, or on a system in your facility (e.g. Gigabyte or whatever rackmount server?)

Sep 20 '25 00:09 geerlingguy

I ran it from a tuned installation, the one AMD provides: https://www.amd.com/en/developer/zen-software-studio/applications/pre-built-applications.html

I'll also try to do it with yours!

Sep 20 '25 00:09 anematode

That'd be awesome! I haven't had much time to test on AMD CPUs, so haven't looked into optimizing the setup for it.

Sep 20 '25 00:09 geerlingguy

To answer your other questions, it's running locally (the power measurement is from my UPS) and I haven't tried other P or Q values or NUMA, just the default NPS=1

Big fan of your videos, btw! :)

Sep 20 '25 00:09 anematode

Thanks! That's a beast of a machine, fun to have local access to it. It'll be interesting to see how poorly the defaults in this repo do—I'm pretty sure AMD tunes a bit for AVX512 and other extensions they have enabled on that chip. Ampere also spent some time tweaking HPL, and I'd like to figure out how to get some of the vendor-specific tweaks documented better.

Sep 20 '25 00:09 geerlingguy

(Adding planned label as I'd like to add it to the result data later.)

Sep 20 '25 00:09 geerlingguy

Have you tried benchmarking on Graviton 4, btw? I'd expect its SVE implementation to be very good

Sep 20 '25 00:09 anematode

Ok, I tried using this repository, using blis under the skx configuration (they don't have a zen4 or zen5 configuration; without that, it was going down a generic code path) and with P=8, Q=16:

================================================================================                                             20:26:40 [29/1918]
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018                                                             
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK                                                         
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK                                                                     
Modified by Julien Langou, University of Colorado Denver                                                                             
================================================================================                                                         
                                                                                                                                     
An explanation of the input/output parameters follows:                                                                               
T/V    : Wall time / encoded variant.                                                                                                
N      : The order of the coefficient matrix A.                                                                                      
NB     : The partitioning blocking factor.                                                                                           
P      : The number of process rows.                                                                                                 
Q      : The number of process columns.                                                                                              
Time   : Time in seconds to solve the linear system.                                                                                 
Gflops : Rate of execution for solving the linear system.                                                                            
                                                                                                                                     
The following parameter values will be used:                                                                                         
                                                                                                                                     
N      :  368904                                                                                                                     
NB     :     456                                                                                                                     
PMAP   : Row-major process mapping                                                                                                   
P      :       8                                                                                                                     
Q      :      16                                                                                                                     
PFACT  :   Crout                                                                                                                     
NBMIN  :      38                                                                                                                     
NDIV   :       3                                                                                                                     
RFACT  :   Right                                                                                                                     
BCAST  :  1ringM                                                                                                                     
DEPTH  :       0                                                                                                                     
SWAP   : Spread-roll (long)                                                                                                          
L1     : transposed form                                          
U      : transposed form                                          
EQUIL  : no                                                       
ALIGN  : 8 double precision words   

--------------------------------------------------------------------------------                                                         

- The matrix A is randomly generated for each test.                    
- The following scaled residual check will be computed:                
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )                                                                     
- The relative machine precision (eps) is taken to be               1.110223e-16                                                               
- Computational tests pass if scaled residuals are less than                16.0                                                               

================================================================================                                                               
T/V                N    NB     P     Q               Time                 Gflops                                                               
--------------------------------------------------------------------------------                                                               
WR01R3C38      368904   456     8    16           10937.40             3.0601e+03                                                              
HPL_pdgesv() start time Fri Sep 19 20:26:49 2025                       

HPL_pdgesv() end time   Fri Sep 19 23:29:06 2025                       

--------------------------------------------------------------------------------                                                               
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   1.61175442e-03 ...... PASSED                                                               
================================================================================                                                               

Finished      1 tests with the following results:                      
              1 tests completed and passed residual checks,            
              0 tests completed and failed residual checks,            
              0 tests skipped because of illegal input values.         
--------------------------------------------------------------------------------                                                               

End of Tests.                      
================================================================================

So almost twice as slow :( I examined the AMD-provided sample more closely and it looks to be using AMD's fork of blis, spending the vast majority of its time in https://github.com/amd/blis/blob/master/kernels/zen4/3/bli_dgemm_avx512_asm_8x24.c . I'm guessing that's the biggest cause of the performance difference, but I'm not really sure.

Sep 20 '25 06:09 anematode

Interesting, I wonder if their AVX512 optimizations could hit upstream BLIS? That seems to be the same issue I had with Ampere—they made some tweaks to help their NEON implementation speed up the runs, but it was all in an Oracle fork of OpenBLAS (IIRC)... and a bit hard to decipher exactly what was changed.

I'm tempted to put the number from this repo's setup as the 'official' result in the README, but I used the optimized run for Ampere's servers, so I guess we could just use the optimized result for AMD too.

(Could also be useful to modify the Ansible playbooks to allow for easier switching to the vendor-supplied libraries, like for AMD's BLIS fork in this case).

Sep 21 '25 00:09 geerlingguy

Yeah, I'm a little confused. My understanding is that on a single node, HPL figures should be close to the theoretical maximum FLOPs/second since it's mostly just spamming FMAs in level-3 BLAS rather than stressing the memory subsystem. Although, even AMD's tuned implementation is far off from the theoretical maximum. For your benchmarks on Apple, did you try using their Accelerate framework?

Sep 21 '25 01:09 anematode

@anematode - No, for Macs, I am running in a VM, so missing some optimizations. See https://github.com/geerlingguy/top500-benchmark/issues/58

Sep 22 '25 01:09 geerlingguy