kokkos-kernels icon indicating copy to clipboard operation
kokkos-kernels copied to clipboard

Low performance of SpMV on AMD Epyc 7402

Open pmpakos opened this issue 2 years ago • 2 comments

Hello I have compiled the 'kokkos' project for AMD ZEN2 architecture (as reported here) (-DKokkos_ARCH_ZEN2=On) and with OpenMP enabled (-DKokkos_ENABLE_OPENMP=On) After building successfully, I compiled the 'kokkos-kernels' project too, with MKL Third-party library enabled (-DKokkosKernels_ENABLE_TPL_MKL=On)

However, when running the spmv test (./perf_test/sparse/sparse_spmv) linked here, I get lower performance, to what was expected. (number of iterations is set to 128, and the examined matrix is scircuit

For example, when running the omp-static test average performance that is reported is 1.836 GFLOPs (for 24 threads), while a naive-csr openmp implementation that was handwritten achieves 10.5 GFLOPs ! Additionally, when MKL is examined, average performance through kokkos-spmv is reported as 1.76 GFLOPs, while an equivalent handwritten MKL call achieves 31 GFLOPs.

I would like to kindly ask you if any configuration of 'kokkos' and 'kokkos-kernels' projects is done wrong. Thank you

pmpakos avatar Jun 03 '22 07:06 pmpakos

Hi @pmpakos, thanks for bringing this to our attention. Could you please let me know

  • which version of Kokkos Kernels you're using
  • which MKL you are using
  • Which system you are testing on (e.g. if it's a particular supercomputer, we may have access as well)
  • How you have configured Kokkos / Kokkos Kernels (your full cmake command)

cwpearson avatar Jun 03 '22 15:06 cwpearson

Notes on reproducing:

Internally Kokkos MKL TPL uses:

  • mkl_scsrmv(&mode, &m, &n, &alpha, "G**C", Avalues, Aentries, Arowptrs, Arowptrs + 1, x, &beta, y);
  • mkl_dcsrmv(&mode, &m, &n, &alpha, "G**C", Avalues, Aentries, Arowptrs, Arowptrs + 1, x, &beta, y);

Built on perlmutter as follows:

module load PrgEnv-gnu
module load cmake/3.22.0
module load cudatoolkit
module load cpe-cuda
module load fast-mkl-amd/fast-mkl-amd
$ module list

Currently Loaded Modules:
  1) craype-x86-milan                       8) Nsight-Compute/2022.1.1  15) craype/2.7.15
  2) libfabric/1.11.0.4.114                 9) Nsight-Systems/2022.2.1  16) gcc/10.3.0
  3) craype-network-ofi                    10) cudatoolkit/11.5         17) perftools-base/22.04.0
  4) xpmem/2.3.2-2.2_7.5__g93dd7ee.shasta  11) PrgEnv-gnu/8.3.3         18) cpe-cuda/22.04
  5) xalt/2.10.2                           12) cray-dsmml/0.2.2         19) fast-mkl-amd/fast-mkl-amd
  6) darshan/3.3.1                         13) cray-libsci/21.08.1.2
  7) cmake/3.22.0                          14) cray-mpich/8.1.15
export MKLROOT=$HOME/intel/oneapi/mkl/2022.0.2
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$HOME/intel/oneapi/compiler/2022.0.2/linux/compiler/lib/intel64_lin"
cmake .. \
-DCMAKE_CXX_COMPILER=g++ \
-DCMAKE_BUILD_TYPE=Release \
-DKokkos_ENABLE_HWLOC=ON \
-DKokkos_ARCH_ZEN3=ON \
-DKokkosKernels_INST_COMPLEX_FLOAT=OFF \
-DKokkosKernels_INST_DOUBLE=ON \
-DKokkosKernels_INST_FLOAT=ON \
-DKokkosKernels_INST_OFFSET_INT=ON \
-DKokkosKernels_INST_OFFSET_SIZE_T=OFF \
-DKokkosKernels_INST_ORDINAL_INT=ON \
-DKokkosKernels_ENABLE_TESTS=ON \
-DKokkos_ENABLE_OPENMP=ON \
-DTPL_ENABLE_MKL=ON \
-DCMAKE_CXX_FLAGS="-I$HOME/intel/oneapi/mkl/2022.0.2/include -DHAVE_MKL" \
-DCMAKE_CXX_LINK_FLAGS="-L$HOME/intel/oneapi/compiler/2022.0.2/linux/compiler/lib/intel64_lin -liomp5"
salloc --nodes 1 --qos debug --time 00:30:00 --constraint gpu --gpus 4 -A m3953_g --gpu-bind=none
env | grep OMP_
PE_LIBSCI_OMP_REQUIRES=
PE_LIBSCI_OMP_REQUIRES_openmp=_mp
PE_LIBSCI_PKGCONFIG_VARIABLES=PE_LIBSCI_OMP_REQUIRES_@openmp@:PE_SCI_EXT_LIBPATH:PE_SCI_EXT_LIBNAME
srun -c 64 --tasks-per-node 1  kokkos-kernels/perf_test/sparse/sparse_spmv --test mkl  -f ../scircuit/scircuit.mtx
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
  In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
  For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
  For unit testing set OMP_PROC_BIND=false
running CRS matrix single vec
NNZ NumRows NumCols ProblemSize(MB) AveBandwidth(GB/s) MinBandwidth(GB/s) MaxBandwidth(GB/s) AveGFlop MinGFlop MaxGFlop aveTime(ms) maxTime(ms) minTime(ms) numErrors
958936 110503 110503  13.08 (  56.06  11.92  78.50 ) (  5.676  1.207  7.949 ) (  0.338  1.589  0.241 ) 0 RESULT
Kokkos::MultiVector Test: Passed
srun -c 64 --tasks-per-node 1  kokkos-kernels/perf_test/sparse/sparse_spmv --test kk  -f ../scircuit/scircuit.mtx
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
  In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
  For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
  For unit testing set OMP_PROC_BIND=false
running CRS matrix single vec
NNZ NumRows NumCols ProblemSize(MB) AveBandwidth(GB/s) MinBandwidth(GB/s) MaxBandwidth(GB/s) AveGFlop MinGFlop MaxGFlop aveTime(ms) maxTime(ms) minTime(ms) numErrors
958936 110503 110503  13.08 (  47.80  32.85  74.72 ) (  4.840  3.327  7.565 ) (  0.396  0.577  0.254 ) 0 RESULT
Kokkos::MultiVector Test: Passed
srun -c 64 --tasks-per-node 1  kokkos-kernels/perf_test/sparse/sparse_spmv --test openmp-static  -f ../scircuit/scircuit.mtx
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
  In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
  For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
  For unit testing set OMP_PROC_BIND=false
running CRS matrix single vec
NNZ NumRows NumCols ProblemSize(MB) AveBandwidth(GB/s) MinBandwidth(GB/s) MaxBandwidth(GB/s) AveGFlop MinGFlop MaxGFlop aveTime(ms) maxTime(ms) minTime(ms) numErrors
958936 110503 110503  13.08 (  44.29  35.85  62.88 ) (  4.484  3.630  6.367 ) (  0.428  0.528  0.301 ) 0 RESULT
Kokkos::MultiVector Test: Passed
srun -c 64 --tasks-per-node 1  kokkos-kernels/perf_test/sparse/sparse_spmv --test openmp-dynamic  -f ../scircuit/scircuit.mtx
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
  In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
  For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
  For unit testing set OMP_PROC_BIND=false
running CRS matrix single vec
NNZ NumRows NumCols ProblemSize(MB) AveBandwidth(GB/s) MinBandwidth(GB/s) MaxBandwidth(GB/s) AveGFlop MinGFlop MaxGFlop aveTime(ms) maxTime(ms) minTime(ms) numErrors
958936 110503 110503  13.08 (  51.57  36.15  70.36 ) (  5.222  3.660  7.124 ) (  0.367  0.524  0.269 ) 0 RESULT
Kokkos::MultiVector Test: Passed
srun -c 64 --tasks-per-node 1  kokkos-kernels/perf_test/sparse/sparse_spmv --test kk-kernels  -f ../scircuit/scircuit.mtx
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
  In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
  For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
  For unit testing set OMP_PROC_BIND=false
running CRS matrix single vec
NNZ NumRows NumCols ProblemSize(MB) AveBandwidth(GB/s) MinBandwidth(GB/s) MaxBandwidth(GB/s) AveGFlop MinGFlop MaxGFlop aveTime(ms) maxTime(ms) minTime(ms) numErrors
958936 110503 110503  13.08 (  72.19  47.45 105.40 ) (  7.309  4.804 10.672 ) (  0.262  0.399  0.180 ) 0 RESULT
Kokkos::MultiVector Test: Passed

cwpearson avatar Jun 03 '22 16:06 cwpearson