Benchmark example using Intel MKL (for history)

Open Laurae2 opened this issue 7 years ago • 1 comments

(this issue for history and potential improvements for laser later: especially AVX-512 and dual port AVX-512)

After chatting for hours with @mratsim to find benchmark Laser with a 72 thread machine and getting a working MKL setup, here is an example benchmark using Intel MKL. We are assuming multiple MKL installations, and using a specific version stored in /opt/intel/compilers_and_libraries_2019.0.117 with the following settings:

We assume also you do not have any nim installation, if you do have you know what lines to skip.

Change the number of threads right at the beginning (OMP_NUM_THREADS). We are using commit 990e59f

source /opt/intel/mkl/bin/mklvars.sh intel64
export OMP_NUM_THREADS=1

curl https://nim-lang.org/choosenim/init.sh -sSf | sh
git clone --recursive git://github.com/numforge/laser
cd laser
git checkout 990e59f
git submodule init
git submodule update

Before compiling, change the following in https://github.com/numforge/laser/blob/990e59fffe50779cdef33aa0b8f22da19e1eb328/benchmarks/blas.nim#L5 to the following (change the MKL folders if needed):

  const blas = "libmkl_intel_ilp64.so"
  {.passC: "-I' /opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/include' -L'/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin'".}

Change the following here https://github.com/numforge/laser/blob/990e59fffe50779cdef33aa0b8f22da19e1eb328/benchmarks/gemm/gemm_bench_float32.nim#L53-L55 to:

  M     = 2304
  K     = 2304
  N     = 2304

Tune the following to your likings, here I used my Dual Xeon Gold 6154 and put 100 repeated computations:

  NbSamples = 100    # This might stresss the allocator when packing if the matrices are big
  CpuGhz = 3.7      # Assuming no turbo
  NumCpuCores = 36
  CpuFlopCycle = 32 # AVX2: 2xFMA/cycle = 2x8x2 - 2 x 8 floats x (1 add + 1 mul)

For the CpuFlopCycle, you need to check the implemented instructions here:

https://github.com/numforge/laser/blob/990e59fffe50779cdef33aa0b8f22da19e1eb328/laser/primitives/matrix_multiplication/gemm_ukernel_avx_fma.nim#L10-L23

Also, tune this to your preference https://github.com/numforge/laser/blob/990e59fffe50779cdef33aa0b8f22da19e1eb328/laser/primitives/matrix_multiplication/gemm_tiling.nim#L234-L235 (I tuned again for my Dual Xeon Gold 6154):

  result.mc = min(768 div T.sizeof, M)
  result.kc = min(4096 div T.sizeof, K)

And now you can compile with MKL (change the MKL folders if needed):

mkdir build
LD_LIBRARY_PATH=/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin nim cpp --dynlibOverride:libmkl_intel_ilp64 --passL:"/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin/libmkl_intel_ilp64.a -Wl,--no-as-needed -lmkl_intel_ilp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl" --passC:"-D_GNU_SOURCE -L$MKLROOT/lib/intel64_lin -DMKL_ILP64 -m64" -r -d:release -d:openmp -o:build/bench_gemm benchmarks/gemm/gemm_bench_float32.nim

On a Dual Xeon 6154 setup (36 physical cores / 72 logical cores, 3.7 GHz all turbo), you should get the following:

Tool	FLOPS
Intel MKL	4 TFLOPS
Laser	600 GFLOPS
PyTorch Glow	60 GFLOPS

As you can see, we are nearly reaching the maximum possible theoretical performance:

Dec 26 '18 11:12 Laurae2

Newer results:

cd Downloads/Nim
rm -rf laser
source /opt/intel/mkl/bin/mklvars.sh intel64
export OMP_NUM_THREADS=1
git clone --recursive git://github.com/numforge/laser
cd laser
git checkout dbfb31d
git submodule init
git submodule update

cd build
gedit benchmarks/third_party/blas.nim
gedit benchmarks/gemm/gemm_bench_float32.nim
gedit laser/primitives/matrix_multiplication/gemm_tiling.nim

export OMP_NUM_THREADS=72
rm -rf build
mkdir build
LD_LIBRARY_PATH=/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin nim cpp --dynlibOverride:libmkl_intel_ilp64 --passL:"/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin/libmkl_intel_ilp64.a -Wl,--no-as-needed -lmkl_intel_ilp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl" --passC:"-D_GNU_SOURCE -L$MKLROOT/lib/intel64_lin -DMKL_ILP64 -m64" -r -d:release -d:openmp -o:build/bench_gemm benchmarks/gemm/gemm_bench_float32.nim

Results (OpenBLAS = MKL):

Hint: /home/laurae/Downloads/Nim/laser/build/bench_gemm  [Exec]

A matrix shape: (M: 1920, N: 1920)
B matrix shape: (M: 1920, N: 1920)
Output shape: (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    179.200 GFLOP/s
Theoretical peak multi:         6451.200 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.

OpenBLAS benchmark
Collected 10000 samples in 52.753 seconds
Average time: 3.492 ms
Stddev  time: 0.736 ms
Min     time: 3.162 ms
Max     time: 31.532 ms
Perf:         4053.552 GFLOP/s

Laser production implementation
Collected 10000 samples in 131.145 seconds
Average time: 11.152 ms
Stddev  time: 8.307 ms
Min     time: 7.360 ms
Max     time: 132.827 ms
Perf:         1269.368 GFLOP/s

PyTorch Glow: libjit matmul implementation (with AVX+FMA)
Collected 10000 samples in 2277.353 seconds
Average time: 227.735 ms
Stddev  time: 4.743 ms
Min     time: 224.159 ms
Max     time: 249.707 ms
Perf:         62.159 GFLOP/s

MKL-DNN reference GEMM benchmark
Collected 10000 samples in 268.515 seconds
Average time: 24.684 ms
Stddev  time: 6.915 ms
Min     time: 21.277 ms
Max     time: 86.476 ms
Perf:         573.477 GFLOP/s

MKL-DNN JIT AVX benchmark
Collected 10000 samples in 89.331 seconds
Average time: 6.755 ms
Stddev  time: 4.773 ms
Min     time: 5.215 ms
Max     time: 77.110 ms
Perf:         2095.728 GFLOP/s

MKL-DNN JIT AVX512 benchmark
Collected 10000 samples in 61.314 seconds
Average time: 4.260 ms
Stddev  time: 4.065 ms
Min     time: 3.071 ms
Max     time: 60.712 ms
Perf:         3322.757 GFLOP/s
Mean Relative Error compared to vendor BLAS: 4.056792022311129e-06

Mar 24 '19 16:03 Laurae2