kokkos icon indicating copy to clipboard operation
kokkos copied to clipboard

View initialization doesn't seem to handle first touch correctly

Open mwarusz opened this issue 11 months ago • 5 comments

Describe the bug

Kokkos doesn't seem to handle first touch correctly in view initialization, leading to substantial performance degradation with the OpenMP backend. #7969 seems related, but I think my issue is more concrete.

Please include the following for a minimal reproducer

  1. Compilers (with versions) Intel(R) oneAPI DPC++/C++ Compiler 2024.1.0 (2024.1.0.20240308)
  2. Kokkos release or commit used (i.e., the sha1 number) 4.6.0
  3. Platform, architecture and backend blake, X86, OpenMP
  4. CMake configure command cmake .. -DCMAKE_CXX_COMPILER=icpx -DCMAKE_BUILD_TYPE=Release -DKokkos_ENABLE_SERIAL=OFF -DKokkos_ENABLE_OPENMP=ON -DKokkos_ARCH_NATIVE=ON
  5. Output from CMake configure command
  6. Minimum, complete code needed to reproduce the bug
#include <Kokkos_Core.hpp>
#include <cstdlib>

using Real = double;

using ExecSpace = Kokkos::DefaultHostExecutionSpace;
using MemSpace = ExecSpace::memory_space;
using Layout = ExecSpace::array_layout;
using RangePolicy = Kokkos::RangePolicy<ExecSpace>;
using Real1d = Kokkos::View<Real *, Layout, MemSpace>;

using Kokkos::parallel_for;
using Kokkos::Timer;

void benchmark_default(int n) {
  Real1d a("a", n);
  
  parallel_for(RangePolicy(0, n), KOKKOS_LAMBDA(int i) {
    a(i) = i % 256;
  });
  
  Real1d b("b", n);

  const int nrep = 10;
  Timer timer;
  for (int r = 0; r < nrep; ++r) {
    parallel_for(RangePolicy(0, n), KOKKOS_LAMBDA(int i) {
        b(i) = a(i) * a(i);
    });
  }
  const double sec = timer.seconds() / nrep;
  const double bandwidth = (2 * sizeof(Real) * n) / (sec * 1e9);

  Kokkos::printf("Using Kokkos default init:\n");
  Kokkos::printf("  %12s %12s %12s\n", "n", "bandwidth", "time");
  Kokkos::printf("  %12d %12.2f %12.2e\n", n, bandwidth, sec);
}

void benchmark_manual(int n) {
  Real1d a(Kokkos::view_alloc("a", Kokkos::WithoutInitializing), n);
  
  parallel_for(RangePolicy(0, n), KOKKOS_LAMBDA(int i) {
    a(i) = i % 256;
  });
  
  Real1d b(Kokkos::view_alloc("b", Kokkos::WithoutInitializing), n);
 
  // manually init b in parallel 
  parallel_for(RangePolicy(0, n), KOKKOS_LAMBDA(int i) {
    b(i) = 0;
  });

  const int nrep = 10;
  Timer timer;
  for (int r = 0; r < nrep; ++r) {
    parallel_for(RangePolicy(0, n), KOKKOS_LAMBDA(int i) {
        b(i) = a(i) * a(i);
    });
  }
  const double sec = timer.seconds() / nrep;
  const double bandwidth = (2 * sizeof(Real) * n) / (sec * 1e9);

  Kokkos::printf("Using manual first touch init:\n");
  Kokkos::printf("  %12s %12s %12s\n", "n", "bandwidth", "time");
  Kokkos::printf("  %12d %12.2f %12.2e\n", n, bandwidth, sec);
}

int main(int argc, char* argv[]) {
  auto guard = Kokkos::ScopeGuard(argc, argv);
  
  const int n = argc > 1 ? std::atoi(argv[1]) : 1e9;
  benchmark_default(n);
  benchmark_manual(n);
}
  1. Command line needed to reproduce the bug

On blake SPR node (2 NUMA domains):

$ export OMP_NUM_THREADS=96
$ export OMP_PLACES=threads
$ export OMP_PROC_BIND=spread
$ ./first_touch
Using Kokkos default init:
             n    bandwidth         time
    1000000000        98.14     1.63e-01
Using manual first touch init:
             n    bandwidth         time
    1000000000       338.75     4.72e-02

On blake GR node (6 NUMA domains):

$ export OMP_NUM_THREADS=256
$ export OMP_PLACES=threads
$ export OMP_PROC_BIND=spread
$ ./first_touch
Using Kokkos default init:
             n    bandwidth         time
    1000000000       190.36     8.41e-02
Using manual first touch init:
             n    bandwidth         time
    1000000000      1044.55     1.53e-02
  1. KokkosCore_config.h header file (generated during the build)
  2. Please provide any additional relevant error logs

mwarusz avatar Apr 30 '25 20:04 mwarusz

We might want to look if we should revise not using memset for these large NUMA domains. @PaulGannay did you have time to look at this already? Let me know if I can help

JBludau avatar May 05 '25 14:05 JBludau

Not yet, I'm working on something else right now, but it is the next thing I will tackle.

Having a reproducer immediately available will make it easier, thank you @mwarusz.

PaulGannay avatar May 06 '25 06:05 PaulGannay

I think the issue comes from memset not spreading memory across NUMA nodes efficiently.

It probably only touches memory close to the thread that calls it, while a parallel_for lets each thread access its local memory, which spreads it better across NUMA nodes.

It appears that this is a well-known issue: https://github.com/kokkos/kokkos/blob/develop/core/src/Kokkos_CopyViews.hpp#L861

This behavior has been addressed for copies (see https://github.com/kokkos/kokkos/pull/6573), but it seems it has not yet been resolved for initializations. https://github.com/kokkos/kokkos/blob/develop/core/src/View/Kokkos_ViewAlloc.hpp#L135-L167 which uses std::memset

AdRi1t avatar May 06 '25 07:05 AdRi1t

@AdRi1t can you look at a perf test to see how large of a numa node we need to benefit from using a parallel_for to initialize vs a memset?

Should be sufficient to wrap creating a view with initialization in a timer

JBludau avatar May 13 '25 14:05 JBludau

Yes, I'll take a look at it.

AdRi1t avatar May 13 '25 16:05 AdRi1t