kokkos-kernels
kokkos-kernels copied to clipboard
Incomplete capture of Benchmark context on multi-GPU systems
https://github.com/kokkos/kokkos-kernels/blob/b9c1bab7a8ae7c9413ad09f612f9011ff04a819e/perf_test/Benchmark_Context.hpp#L58
On a system with multiple GPUs, we'll see something like
Failed to add custom context "Kokkos" as it already exists with value "Cuda[ 0 ] Tesla V100-PCIE-16GB capability 7.0, Total Global Memory: 15.77 G, Shared Memory per Block: 48 K : Selected"
Failed to add custom context "Kokkos" as it already exists with value "Cuda[ 1 ] Tesla V100-PCIE-16GB capability 7.0, Total Global Memory: 15.77 G, Shared Memory per Block: 48 K"
This occurs due to the handling of Kokkos::print_config() which looks like this:
Kokkos Version: 4.0.0
Compiler:
KOKKOS_COMPILER_GNU: 1010
KOKKOS_COMPILER_NVCC: 1140
Architecture:
CPU architecture: none
Default Device: N6Kokkos4CudaE
GPU architecture: VOLTA70
Atomics:
KOKKOS_ENABLE_GNU_ATOMICS: no
KOKKOS_ENABLE_INTEL_ATOMICS: no
KOKKOS_ENABLE_WINDOWS_ATOMICS: no
Vectorization:
KOKKOS_ENABLE_PRAGMA_IVDEP: no
KOKKOS_ENABLE_PRAGMA_LOOPCOUNT: no
KOKKOS_ENABLE_PRAGMA_UNROLL: no
KOKKOS_ENABLE_PRAGMA_VECTOR: no
Memory:
KOKKOS_ENABLE_HBWSPACE: no
KOKKOS_ENABLE_INTEL_MM_ALLOC: no
Options:
KOKKOS_ENABLE_ASM: yes
KOKKOS_ENABLE_CXX17: yes
KOKKOS_ENABLE_CXX20: no
KOKKOS_ENABLE_CXX23: no
KOKKOS_ENABLE_DEBUG_BOUNDS_CHECK: no
KOKKOS_ENABLE_HWLOC: no
KOKKOS_ENABLE_LIBDL: yes
KOKKOS_ENABLE_LIBRT: no
Host Parallel Execution Space:
KOKKOS_ENABLE_OPENMP: yes
OpenMP Atomics:
KOKKOS_ENABLE_OPENMP_ATOMICS: no
OpenMP Runtime Configuration:
Kokkos::OpenMP thread_pool_topology[ 1 x 40 x 1 ]
Device Execution Space:
KOKKOS_ENABLE_CUDA: yes
Cuda Atomics:
KOKKOS_ENABLE_CUDA_ATOMICS: no
Cuda Options:
KOKKOS_ENABLE_CUDA_LAMBDA: yes
KOKKOS_ENABLE_CUDA_LDG_INTRINSIC: yes
KOKKOS_ENABLE_CUDA_RELOCATABLE_DEVICE_CODE: no
KOKKOS_ENABLE_CUDA_UVM: no
KOKKOS_ENABLE_CXX11_DISPATCH_LAMBDA: yes
Cuda Runtime Configuration:
macro KOKKOS_ENABLE_CUDA : defined
macro CUDA_VERSION = 11040 = version 11.4
Kokkos::Cuda[ 0 ] Tesla V100-PCIE-16GB capability 7.0, Total Global Memory: 15.77 G, Shared Memory per Block: 48 K : Selected
Kokkos::Cuda[ 1 ] Tesla V100-PCIE-16GB capability 7.0, Total Global Memory: 15.77 G, Shared Memory per Block: 48 K
This parsing works by splitting each line on : and adding the before and after as a key-value pair to the benchmark.
This breaks on the last two lines, because of the Kokkos::Cuda string at the beginning.
A fix may just be to use find_last_of instead of find_first_of
https://github.com/kokkos/kokkos-kernels/blob/ff097ec635752ed73160feba598cdc68c372f4bd/perf_test/Benchmark_Context.hpp#L56
thereby splitting on the last : instead of the first one
@meriadegp