Benchmark example using Intel MKL (for history)
(this issue for history and potential improvements for laser later: especially AVX-512 and dual port AVX-512)
After chatting for hours with @mratsim to find benchmark Laser with a 72 thread machine and getting a working MKL setup, here is an example benchmark using Intel MKL. We are assuming multiple MKL installations, and using a specific version stored in /opt/intel/compilers_and_libraries_2019.0.117 with the following settings:

We assume also you do not have any nim installation, if you do have you know what lines to skip.
Change the number of threads right at the beginning (OMP_NUM_THREADS). We are using commit 990e59f
source /opt/intel/mkl/bin/mklvars.sh intel64
export OMP_NUM_THREADS=1
curl https://nim-lang.org/choosenim/init.sh -sSf | sh
git clone --recursive git://github.com/numforge/laser
cd laser
git checkout 990e59f
git submodule init
git submodule update
Before compiling, change the following in https://github.com/numforge/laser/blob/990e59fffe50779cdef33aa0b8f22da19e1eb328/benchmarks/blas.nim#L5 to the following (change the MKL folders if needed):
const blas = "libmkl_intel_ilp64.so"
{.passC: "-I' /opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/include' -L'/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin'".}
Change the following here https://github.com/numforge/laser/blob/990e59fffe50779cdef33aa0b8f22da19e1eb328/benchmarks/gemm/gemm_bench_float32.nim#L53-L55 to:
M = 2304
K = 2304
N = 2304
Tune the following to your likings, here I used my Dual Xeon Gold 6154 and put 100 repeated computations:
NbSamples = 100 # This might stresss the allocator when packing if the matrices are big
CpuGhz = 3.7 # Assuming no turbo
NumCpuCores = 36
CpuFlopCycle = 32 # AVX2: 2xFMA/cycle = 2x8x2 - 2 x 8 floats x (1 add + 1 mul)
For the CpuFlopCycle, you need to check the implemented instructions here:
https://github.com/numforge/laser/blob/990e59fffe50779cdef33aa0b8f22da19e1eb328/laser/primitives/matrix_multiplication/gemm_ukernel_avx_fma.nim#L10-L23
Also, tune this to your preference https://github.com/numforge/laser/blob/990e59fffe50779cdef33aa0b8f22da19e1eb328/laser/primitives/matrix_multiplication/gemm_tiling.nim#L234-L235 (I tuned again for my Dual Xeon Gold 6154):
result.mc = min(768 div T.sizeof, M)
result.kc = min(4096 div T.sizeof, K)
And now you can compile with MKL (change the MKL folders if needed):
mkdir build
LD_LIBRARY_PATH=/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin nim cpp --dynlibOverride:libmkl_intel_ilp64 --passL:"/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin/libmkl_intel_ilp64.a -Wl,--no-as-needed -lmkl_intel_ilp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl" --passC:"-D_GNU_SOURCE -L$MKLROOT/lib/intel64_lin -DMKL_ILP64 -m64" -r -d:release -d:openmp -o:build/bench_gemm benchmarks/gemm/gemm_bench_float32.nim
On a Dual Xeon 6154 setup (36 physical cores / 72 logical cores, 3.7 GHz all turbo), you should get the following:
| Tool | FLOPS |
|---|---|
| Intel MKL | 4 TFLOPS |
| Laser | 600 GFLOPS |
| PyTorch Glow | 60 GFLOPS |
As you can see, we are nearly reaching the maximum possible theoretical performance:

Newer results:
cd Downloads/Nim
rm -rf laser
source /opt/intel/mkl/bin/mklvars.sh intel64
export OMP_NUM_THREADS=1
git clone --recursive git://github.com/numforge/laser
cd laser
git checkout dbfb31d
git submodule init
git submodule update
cd build
gedit benchmarks/third_party/blas.nim
gedit benchmarks/gemm/gemm_bench_float32.nim
gedit laser/primitives/matrix_multiplication/gemm_tiling.nim
export OMP_NUM_THREADS=72
rm -rf build
mkdir build
LD_LIBRARY_PATH=/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin nim cpp --dynlibOverride:libmkl_intel_ilp64 --passL:"/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin/libmkl_intel_ilp64.a -Wl,--no-as-needed -lmkl_intel_ilp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl" --passC:"-D_GNU_SOURCE -L$MKLROOT/lib/intel64_lin -DMKL_ILP64 -m64" -r -d:release -d:openmp -o:build/bench_gemm benchmarks/gemm/gemm_bench_float32.nim
Results (OpenBLAS = MKL):
Hint: /home/laurae/Downloads/Nim/laser/build/bench_gemm [Exec]
A matrix shape: (M: 1920, N: 1920)
B matrix shape: (M: 1920, N: 1920)
Output shape: (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes: 29.491 MB
Arithmetic intensity: 480.000 FLOP/byte
Theoretical peak single-core: 179.200 GFLOP/s
Theoretical peak multi: 6451.200 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.
OpenBLAS benchmark
Collected 10000 samples in 52.753 seconds
Average time: 3.492 ms
Stddev time: 0.736 ms
Min time: 3.162 ms
Max time: 31.532 ms
Perf: 4053.552 GFLOP/s
Laser production implementation
Collected 10000 samples in 131.145 seconds
Average time: 11.152 ms
Stddev time: 8.307 ms
Min time: 7.360 ms
Max time: 132.827 ms
Perf: 1269.368 GFLOP/s
PyTorch Glow: libjit matmul implementation (with AVX+FMA)
Collected 10000 samples in 2277.353 seconds
Average time: 227.735 ms
Stddev time: 4.743 ms
Min time: 224.159 ms
Max time: 249.707 ms
Perf: 62.159 GFLOP/s
MKL-DNN reference GEMM benchmark
Collected 10000 samples in 268.515 seconds
Average time: 24.684 ms
Stddev time: 6.915 ms
Min time: 21.277 ms
Max time: 86.476 ms
Perf: 573.477 GFLOP/s
MKL-DNN JIT AVX benchmark
Collected 10000 samples in 89.331 seconds
Average time: 6.755 ms
Stddev time: 4.773 ms
Min time: 5.215 ms
Max time: 77.110 ms
Perf: 2095.728 GFLOP/s
MKL-DNN JIT AVX512 benchmark
Collected 10000 samples in 61.314 seconds
Average time: 4.260 ms
Stddev time: 4.065 ms
Min time: 3.071 ms
Max time: 60.712 ms
Perf: 3322.757 GFLOP/s
Mean Relative Error compared to vendor BLAS: 4.056792022311129e-06