kokkos-kernels
kokkos-kernels copied to clipboard
Low performance of SpMV on AMD Epyc 7402
Hello I have compiled the 'kokkos' project for AMD ZEN2 architecture (as reported here) (-DKokkos_ARCH_ZEN2=On) and with OpenMP enabled (-DKokkos_ENABLE_OPENMP=On) After building successfully, I compiled the 'kokkos-kernels' project too, with MKL Third-party library enabled (-DKokkosKernels_ENABLE_TPL_MKL=On)
However, when running the spmv test (./perf_test/sparse/sparse_spmv) linked here, I get lower performance, to what was expected. (number of iterations is set to 128, and the examined matrix is scircuit
For example, when running the omp-static test average performance that is reported is 1.836 GFLOPs (for 24 threads), while a naive-csr openmp implementation that was handwritten achieves 10.5 GFLOPs ! Additionally, when MKL is examined, average performance through kokkos-spmv is reported as 1.76 GFLOPs, while an equivalent handwritten MKL call achieves 31 GFLOPs.
I would like to kindly ask you if any configuration of 'kokkos' and 'kokkos-kernels' projects is done wrong. Thank you
Hi @pmpakos, thanks for bringing this to our attention. Could you please let me know
- which version of Kokkos Kernels you're using
- which MKL you are using
- Which system you are testing on (e.g. if it's a particular supercomputer, we may have access as well)
- How you have configured Kokkos / Kokkos Kernels (your full cmake command)
Notes on reproducing:
Internally Kokkos MKL TPL uses:
-
mkl_scsrmv(&mode, &m, &n, &alpha, "G**C", Avalues, Aentries, Arowptrs, Arowptrs + 1, x, &beta, y);
-
mkl_dcsrmv(&mode, &m, &n, &alpha, "G**C", Avalues, Aentries, Arowptrs, Arowptrs + 1, x, &beta, y);
Built on perlmutter as follows:
module load PrgEnv-gnu
module load cmake/3.22.0
module load cudatoolkit
module load cpe-cuda
module load fast-mkl-amd/fast-mkl-amd
$ module list
Currently Loaded Modules:
1) craype-x86-milan 8) Nsight-Compute/2022.1.1 15) craype/2.7.15
2) libfabric/1.11.0.4.114 9) Nsight-Systems/2022.2.1 16) gcc/10.3.0
3) craype-network-ofi 10) cudatoolkit/11.5 17) perftools-base/22.04.0
4) xpmem/2.3.2-2.2_7.5__g93dd7ee.shasta 11) PrgEnv-gnu/8.3.3 18) cpe-cuda/22.04
5) xalt/2.10.2 12) cray-dsmml/0.2.2 19) fast-mkl-amd/fast-mkl-amd
6) darshan/3.3.1 13) cray-libsci/21.08.1.2
7) cmake/3.22.0 14) cray-mpich/8.1.15
export MKLROOT=$HOME/intel/oneapi/mkl/2022.0.2
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$HOME/intel/oneapi/compiler/2022.0.2/linux/compiler/lib/intel64_lin"
cmake .. \
-DCMAKE_CXX_COMPILER=g++ \
-DCMAKE_BUILD_TYPE=Release \
-DKokkos_ENABLE_HWLOC=ON \
-DKokkos_ARCH_ZEN3=ON \
-DKokkosKernels_INST_COMPLEX_FLOAT=OFF \
-DKokkosKernels_INST_DOUBLE=ON \
-DKokkosKernels_INST_FLOAT=ON \
-DKokkosKernels_INST_OFFSET_INT=ON \
-DKokkosKernels_INST_OFFSET_SIZE_T=OFF \
-DKokkosKernels_INST_ORDINAL_INT=ON \
-DKokkosKernels_ENABLE_TESTS=ON \
-DKokkos_ENABLE_OPENMP=ON \
-DTPL_ENABLE_MKL=ON \
-DCMAKE_CXX_FLAGS="-I$HOME/intel/oneapi/mkl/2022.0.2/include -DHAVE_MKL" \
-DCMAKE_CXX_LINK_FLAGS="-L$HOME/intel/oneapi/compiler/2022.0.2/linux/compiler/lib/intel64_lin -liomp5"
salloc --nodes 1 --qos debug --time 00:30:00 --constraint gpu --gpus 4 -A m3953_g --gpu-bind=none
env | grep OMP_
PE_LIBSCI_OMP_REQUIRES=
PE_LIBSCI_OMP_REQUIRES_openmp=_mp
PE_LIBSCI_PKGCONFIG_VARIABLES=PE_LIBSCI_OMP_REQUIRES_@openmp@:PE_SCI_EXT_LIBPATH:PE_SCI_EXT_LIBNAME
srun -c 64 --tasks-per-node 1 kokkos-kernels/perf_test/sparse/sparse_spmv --test mkl -f ../scircuit/scircuit.mtx
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
For unit testing set OMP_PROC_BIND=false
running CRS matrix single vec
NNZ NumRows NumCols ProblemSize(MB) AveBandwidth(GB/s) MinBandwidth(GB/s) MaxBandwidth(GB/s) AveGFlop MinGFlop MaxGFlop aveTime(ms) maxTime(ms) minTime(ms) numErrors
958936 110503 110503 13.08 ( 56.06 11.92 78.50 ) ( 5.676 1.207 7.949 ) ( 0.338 1.589 0.241 ) 0 RESULT
Kokkos::MultiVector Test: Passed
srun -c 64 --tasks-per-node 1 kokkos-kernels/perf_test/sparse/sparse_spmv --test kk -f ../scircuit/scircuit.mtx
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
For unit testing set OMP_PROC_BIND=false
running CRS matrix single vec
NNZ NumRows NumCols ProblemSize(MB) AveBandwidth(GB/s) MinBandwidth(GB/s) MaxBandwidth(GB/s) AveGFlop MinGFlop MaxGFlop aveTime(ms) maxTime(ms) minTime(ms) numErrors
958936 110503 110503 13.08 ( 47.80 32.85 74.72 ) ( 4.840 3.327 7.565 ) ( 0.396 0.577 0.254 ) 0 RESULT
Kokkos::MultiVector Test: Passed
srun -c 64 --tasks-per-node 1 kokkos-kernels/perf_test/sparse/sparse_spmv --test openmp-static -f ../scircuit/scircuit.mtx
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
For unit testing set OMP_PROC_BIND=false
running CRS matrix single vec
NNZ NumRows NumCols ProblemSize(MB) AveBandwidth(GB/s) MinBandwidth(GB/s) MaxBandwidth(GB/s) AveGFlop MinGFlop MaxGFlop aveTime(ms) maxTime(ms) minTime(ms) numErrors
958936 110503 110503 13.08 ( 44.29 35.85 62.88 ) ( 4.484 3.630 6.367 ) ( 0.428 0.528 0.301 ) 0 RESULT
Kokkos::MultiVector Test: Passed
srun -c 64 --tasks-per-node 1 kokkos-kernels/perf_test/sparse/sparse_spmv --test openmp-dynamic -f ../scircuit/scircuit.mtx
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
For unit testing set OMP_PROC_BIND=false
running CRS matrix single vec
NNZ NumRows NumCols ProblemSize(MB) AveBandwidth(GB/s) MinBandwidth(GB/s) MaxBandwidth(GB/s) AveGFlop MinGFlop MaxGFlop aveTime(ms) maxTime(ms) minTime(ms) numErrors
958936 110503 110503 13.08 ( 51.57 36.15 70.36 ) ( 5.222 3.660 7.124 ) ( 0.367 0.524 0.269 ) 0 RESULT
Kokkos::MultiVector Test: Passed
srun -c 64 --tasks-per-node 1 kokkos-kernels/perf_test/sparse/sparse_spmv --test kk-kernels -f ../scircuit/scircuit.mtx
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
For unit testing set OMP_PROC_BIND=false
running CRS matrix single vec
NNZ NumRows NumCols ProblemSize(MB) AveBandwidth(GB/s) MinBandwidth(GB/s) MaxBandwidth(GB/s) AveGFlop MinGFlop MaxGFlop aveTime(ms) maxTime(ms) minTime(ms) numErrors
958936 110503 110503 13.08 ( 72.19 47.45 105.40 ) ( 7.309 4.804 10.672 ) ( 0.262 0.399 0.180 ) 0 RESULT
Kokkos::MultiVector Test: Passed