kokkos-kernels Incomplete capture of Benchmark context on multi-GPU systems

Incomplete capture of Benchmark context on multi-GPU systems

Open cwpearson opened this issue 2 years ago • 1 comments

https://github.com/kokkos/kokkos-kernels/blob/b9c1bab7a8ae7c9413ad09f612f9011ff04a819e/perf_test/Benchmark_Context.hpp#L58

On a system with multiple GPUs, we'll see something like

Failed to add custom context "Kokkos" as it already exists with value "Cuda[ 0 ] Tesla V100-PCIE-16GB capability 7.0, Total Global Memory: 15.77 G, Shared Memory per Block: 48 K : Selected"
Failed to add custom context "Kokkos" as it already exists with value "Cuda[ 1 ] Tesla V100-PCIE-16GB capability 7.0, Total Global Memory: 15.77 G, Shared Memory per Block: 48 K"

This occurs due to the handling of Kokkos::print_config() which looks like this:

  Kokkos Version: 4.0.0
Compiler:
  KOKKOS_COMPILER_GNU: 1010
  KOKKOS_COMPILER_NVCC: 1140
Architecture:
  CPU architecture: none
  Default Device: N6Kokkos4CudaE
  GPU architecture: VOLTA70
Atomics:
  KOKKOS_ENABLE_GNU_ATOMICS: no
  KOKKOS_ENABLE_INTEL_ATOMICS: no
  KOKKOS_ENABLE_WINDOWS_ATOMICS: no
Vectorization:
  KOKKOS_ENABLE_PRAGMA_IVDEP: no
  KOKKOS_ENABLE_PRAGMA_LOOPCOUNT: no
  KOKKOS_ENABLE_PRAGMA_UNROLL: no
  KOKKOS_ENABLE_PRAGMA_VECTOR: no
Memory:
  KOKKOS_ENABLE_HBWSPACE: no
  KOKKOS_ENABLE_INTEL_MM_ALLOC: no
Options:
  KOKKOS_ENABLE_ASM: yes
  KOKKOS_ENABLE_CXX17: yes
  KOKKOS_ENABLE_CXX20: no
  KOKKOS_ENABLE_CXX23: no
  KOKKOS_ENABLE_DEBUG_BOUNDS_CHECK: no
  KOKKOS_ENABLE_HWLOC: no
  KOKKOS_ENABLE_LIBDL: yes
  KOKKOS_ENABLE_LIBRT: no
Host Parallel Execution Space:
  KOKKOS_ENABLE_OPENMP: yes
OpenMP Atomics:
  KOKKOS_ENABLE_OPENMP_ATOMICS: no

OpenMP Runtime Configuration:
Kokkos::OpenMP thread_pool_topology[ 1 x 40 x 1 ]
Device Execution Space:
  KOKKOS_ENABLE_CUDA: yes
Cuda Atomics:
  KOKKOS_ENABLE_CUDA_ATOMICS: no
Cuda Options:
  KOKKOS_ENABLE_CUDA_LAMBDA: yes
  KOKKOS_ENABLE_CUDA_LDG_INTRINSIC: yes
  KOKKOS_ENABLE_CUDA_RELOCATABLE_DEVICE_CODE: no
  KOKKOS_ENABLE_CUDA_UVM: no
  KOKKOS_ENABLE_CXX11_DISPATCH_LAMBDA: yes

Cuda Runtime Configuration:
macro  KOKKOS_ENABLE_CUDA      : defined
macro  CUDA_VERSION          = 11040 = version 11.4
Kokkos::Cuda[ 0 ] Tesla V100-PCIE-16GB capability 7.0, Total Global Memory: 15.77 G, Shared Memory per Block: 48 K : Selected
Kokkos::Cuda[ 1 ] Tesla V100-PCIE-16GB capability 7.0, Total Global Memory: 15.77 G, Shared Memory per Block: 48 K

This parsing works by splitting each line on : and adding the before and after as a key-value pair to the benchmark. This breaks on the last two lines, because of the Kokkos::Cuda string at the beginning.

A fix may just be to use find_last_of instead of find_first_of https://github.com/kokkos/kokkos-kernels/blob/ff097ec635752ed73160feba598cdc68c372f4bd/perf_test/Benchmark_Context.hpp#L56 thereby splitting on the last : instead of the first one

Mar 16 '23 22:03 cwpearson

@meriadegp

Mar 16 '23 22:03 cwpearson

kokkos-kernels kokkos-kernels copied to clipboard

Incomplete capture of Benchmark context on multi-GPU systems

kokkos-kernels
kokkos-kernels copied to clipboard