View initialization doesn't seem to handle first touch correctly
Describe the bug
Kokkos doesn't seem to handle first touch correctly in view initialization, leading to substantial performance degradation with the OpenMP backend. #7969 seems related, but I think my issue is more concrete.
Please include the following for a minimal reproducer
- Compilers (with versions)
Intel(R) oneAPI DPC++/C++ Compiler 2024.1.0 (2024.1.0.20240308) - Kokkos release or commit used (i.e., the sha1 number)
4.6.0 - Platform, architecture and backend
blake, X86, OpenMP - CMake configure command
cmake .. -DCMAKE_CXX_COMPILER=icpx -DCMAKE_BUILD_TYPE=Release -DKokkos_ENABLE_SERIAL=OFF -DKokkos_ENABLE_OPENMP=ON -DKokkos_ARCH_NATIVE=ON - Output from CMake configure command
- Minimum, complete code needed to reproduce the bug
#include <Kokkos_Core.hpp>
#include <cstdlib>
using Real = double;
using ExecSpace = Kokkos::DefaultHostExecutionSpace;
using MemSpace = ExecSpace::memory_space;
using Layout = ExecSpace::array_layout;
using RangePolicy = Kokkos::RangePolicy<ExecSpace>;
using Real1d = Kokkos::View<Real *, Layout, MemSpace>;
using Kokkos::parallel_for;
using Kokkos::Timer;
void benchmark_default(int n) {
Real1d a("a", n);
parallel_for(RangePolicy(0, n), KOKKOS_LAMBDA(int i) {
a(i) = i % 256;
});
Real1d b("b", n);
const int nrep = 10;
Timer timer;
for (int r = 0; r < nrep; ++r) {
parallel_for(RangePolicy(0, n), KOKKOS_LAMBDA(int i) {
b(i) = a(i) * a(i);
});
}
const double sec = timer.seconds() / nrep;
const double bandwidth = (2 * sizeof(Real) * n) / (sec * 1e9);
Kokkos::printf("Using Kokkos default init:\n");
Kokkos::printf(" %12s %12s %12s\n", "n", "bandwidth", "time");
Kokkos::printf(" %12d %12.2f %12.2e\n", n, bandwidth, sec);
}
void benchmark_manual(int n) {
Real1d a(Kokkos::view_alloc("a", Kokkos::WithoutInitializing), n);
parallel_for(RangePolicy(0, n), KOKKOS_LAMBDA(int i) {
a(i) = i % 256;
});
Real1d b(Kokkos::view_alloc("b", Kokkos::WithoutInitializing), n);
// manually init b in parallel
parallel_for(RangePolicy(0, n), KOKKOS_LAMBDA(int i) {
b(i) = 0;
});
const int nrep = 10;
Timer timer;
for (int r = 0; r < nrep; ++r) {
parallel_for(RangePolicy(0, n), KOKKOS_LAMBDA(int i) {
b(i) = a(i) * a(i);
});
}
const double sec = timer.seconds() / nrep;
const double bandwidth = (2 * sizeof(Real) * n) / (sec * 1e9);
Kokkos::printf("Using manual first touch init:\n");
Kokkos::printf(" %12s %12s %12s\n", "n", "bandwidth", "time");
Kokkos::printf(" %12d %12.2f %12.2e\n", n, bandwidth, sec);
}
int main(int argc, char* argv[]) {
auto guard = Kokkos::ScopeGuard(argc, argv);
const int n = argc > 1 ? std::atoi(argv[1]) : 1e9;
benchmark_default(n);
benchmark_manual(n);
}
- Command line needed to reproduce the bug
On blake SPR node (2 NUMA domains):
$ export OMP_NUM_THREADS=96
$ export OMP_PLACES=threads
$ export OMP_PROC_BIND=spread
$ ./first_touch
Using Kokkos default init:
n bandwidth time
1000000000 98.14 1.63e-01
Using manual first touch init:
n bandwidth time
1000000000 338.75 4.72e-02
On blake GR node (6 NUMA domains):
$ export OMP_NUM_THREADS=256
$ export OMP_PLACES=threads
$ export OMP_PROC_BIND=spread
$ ./first_touch
Using Kokkos default init:
n bandwidth time
1000000000 190.36 8.41e-02
Using manual first touch init:
n bandwidth time
1000000000 1044.55 1.53e-02
-
KokkosCore_config.hheader file (generated during the build) - Please provide any additional relevant error logs
We might want to look if we should revise not using memset for these large NUMA domains. @PaulGannay did you have time to look at this already? Let me know if I can help
Not yet, I'm working on something else right now, but it is the next thing I will tackle.
Having a reproducer immediately available will make it easier, thank you @mwarusz.
I think the issue comes from memset not spreading memory across NUMA nodes efficiently.
It probably only touches memory close to the thread that calls it, while a parallel_for lets each thread access its local memory, which spreads it better across NUMA nodes.
It appears that this is a well-known issue: https://github.com/kokkos/kokkos/blob/develop/core/src/Kokkos_CopyViews.hpp#L861
This behavior has been addressed for copies (see https://github.com/kokkos/kokkos/pull/6573), but it seems it has not yet been resolved for initializations.
https://github.com/kokkos/kokkos/blob/develop/core/src/View/Kokkos_ViewAlloc.hpp#L135-L167
which uses std::memset
@AdRi1t can you look at a perf test to see how large of a numa node we need to benefit from using a parallel_for to initialize vs a memset?
Should be sufficient to wrap creating a view with initialization in a timer
Yes, I'll take a look at it.