nalu-wind icon indicating copy to clipboard operation
nalu-wind copied to clipboard

Trilinos GPU reg tests issues with Kokkos::parallel_reduce on Summit

Open PaulMullowney opened this issue 2 years ago • 7 comments

airfoilRANSEdgeNGPTrilinos.rst produces the following compute-sanitizer errors (gcc 9.3.0 and cuda 11.0.3) though the test runs to completion. The Hypre version of this test does not produce these warnings. --tool racecheck --racecheck-report all --print-limit 100

Its possible we're seeing issues similar to those resolve with: #942

========= WARNING: Race reported between Write access at 0x4c60 in __nv_static_82__69_tmpxft_001d866f_00000000_7_Tpetra_Details_FixedHashTable_Cuda_cpp1_ii_b028c491__ZN6Kokkos4Impl33cuda_parallel_launch_local_memoryINS0_21ParallelScanWithTotalIN6Tpetra7Details93_GLOBAL__N__69_tmpxft_001d866f_00000000_7_Tpetra_Details_FixedHashTable_Cuda_cpp1_ii_b028c49124ComputeOffsetsFromCountsIiijEENS_11RangePolicyIJNS_4CudaEjEEEiS9_EEEEvT_
=========     and Read access at 0x4f70 in __nv_static_82__69_tmpxft_001d866f_00000000_7_Tpetra_Details_FixedHashTable_Cuda_cpp1_ii_b028c491__ZN6Kokkos4Impl33cuda_parallel_launch_local_memoryINS0_21ParallelScanWithTotalIN6Tpetra7Details93_GLOBAL__N__69_tmpxft_001d866f_00000000_7_Tpetra_Details_FixedHashTable_Cuda_cpp1_ii_b028c49124ComputeOffsetsFromCountsIiijEENS_11RangePolicyIJNS_4CudaEjEEEiS9_EEEEvT_ [4 hazards]
=========
========= WARNING: (Warp Level Programming) Potential RAW hazard detected at __shared__ 0x80 in block (0,0,0) :
=========     Write Thread (0,32,0) at 0x1ca0 in __nv_static_82__69_tmpxft_001d866f_00000000_7_Tpetra_Details_FixedHashTable_Cuda_cpp1_ii_b028c491__ZN6Kokkos4Impl33cuda_parallel_launch_local_memoryINS0_21ParallelScanWithTotalIN6Tpetra7Details93_GLOBAL__N__69_tmpxft_001d866f_00000000_7_Tpetra_Details_FixedHashTable_Cuda_cpp1_ii_b028c49124ComputeOffsetsFromCountsIiijEENS_11RangePolicyIJNS_4CudaEjEEEiS9_EEEEvT_
=========     Read Thread (0,33,0) at 0x2210 in __nv_static_82__69_tmpxft_001d866f_00000000_7_Tpetra_Details_FixedHashTable_Cuda_cpp1_ii_b028c491__ZN6Kokkos4Impl33cuda_parallel_launch_local_memoryINS0_21ParallelScanWithTotalIN6Tpetra7Details93_GLOBAL__N__69_tmpxft_001d866f_00000000_7_Tpetra_Details_FixedHashTable_Cuda_cpp1_ii_b028c49124ComputeOffsetsFromCountsIiijEENS_11RangePolicyIJNS_4CudaEjEEEiS9_EEEEvT_
=========     Current Value : 0
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2000511f080c]
=========                in /lib64/libcuda.so.1

PaulMullowney avatar Mar 04 '22 21:03 PaulMullowney

This doesn't look like a nalu-wind issue. Is there a reason it's filed as a nalu-wind issue rather than a kokkos and/or tpetra issue?

alanw0 avatar Mar 07 '22 14:03 alanw0

It's not clear to me where the issue resides. These warnings are very similar to what I saw in #929 and subsequently fixed in #942. I agree that this may be a Kokkos/Trilinos problem, however one needs to run this Nalu regression test to see it.

PaulMullowney avatar Mar 07 '22 17:03 PaulMullowney

I was able to reproduce this and the segfault from #929 on ascicgpu, but it's unfortunate that this only happens on a release build. I'll see if I can trace the problem or make a simpler reproducer.

ldh4 avatar Mar 08 '22 04:03 ldh4

It is possible for memcheck to emit RAW type errors using just Kokkos core unit testing: cuda-memcheck --tool racecheck --racecheck-report all --print-limit 1000 ./KokkosCore_UnitTest_Cuda2.exe --gtest_filter=cuda.team_vector will print out lots of errors very quickly if compiled RELEASE. I can not tell if this is the same parallel_reduce warning that is seen on Summit. The function signature printed has "functor_team_reduce" in a template argument so it might be. There are also lots of warnings from other tests but it is possible that all of them are false positives.

hostname: ascicgpu17 which mpicxx: /projects/sierra/linux_rh7/SDK/mpi/openmpi/4.0.5-nvidia-11.2.0-RHEL7/bin/mpicxx Currently Loaded Modulefiles:

  1. sparc-tools/python/3.7.9
  2. sparc-tools/exodus/2021.11.26
  3. sparc-tools/aerotools/3
  4. sparc-tools/taos/2020.09.23
  5. sparc-cmake/3.19.4
  6. sparc-git/2.19.1
  7. sparc-dev/cuda-11.2.0_gcc-8.3.0_openmpi-4.0.5

overfelt avatar Mar 08 '22 23:03 overfelt

I found more examples of problems in a parallel reduce. This one occurs for a RelWithDebInfo build (gcc 10.2, cuda 11.4.2) at: https://github.com/Exawind/nalu-wind/blob/master/src/HypreUVWLinearSystem.C#L527 Notice that in each case, the read/write threads are on opposite warp boundaries.

========= WARNING: (Warp Level Programming) Potential RAW hazard detected at __shared__ 0x108 in block (40,0,0) :
=========     Write Thread (0,33,0) at 0x540 in void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelReduce<Kokkos::Impl::CudaFunctorAdapter<sierra::nalu::HypreUVWLinearSystem::copy_hypre_to_stk(stk::mesh::FieldBase *, std::vector<double, std::allocator<double>> &)::[lambda(int, double &) (instance 1)], Kokkos::RangePolicy<Kokkos::Cuda>, double, void>, Kokkos::RangePolicy<Kokkos::Cuda>, Kokkos::InvalidType, Kokkos::Cuda>>(T1)
=========     Read Thread (0,32,0) at 0x5d0 in void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelReduce<Kokkos::Impl::CudaFunctorAdapter<sierra::nalu::HypreUVWLinearSystem::copy_hypre_to_stk(stk::mesh::FieldBase *, std::vector<double, std::allocator<double>> &)::[lambda(int, double &) (instance 1)], Kokkos::RangePolicy<Kokkos::Cuda>, double, void>, Kokkos::RangePolicy<Kokkos::Cuda>, Kokkos::InvalidType, Kokkos::Cuda>>(T1)
=========     Current Value : 0
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x42e69c]
=========                in /sw/summit/cuda/11.4.2/lib64/libcuda.so.1
=========
========= WARNING: (Warp Level Programming) Potential RAW hazard detected at __shared__ 0x408 in block (2,0,0) :
=========     Write Thread (0,129,0) at 0x540 in void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelReduce<Kokkos::Impl::CudaFunctorAdapter<sierra::nalu::HypreUVWLinearSystem::copy_hypre_to_stk(stk::mesh::FieldBase *, std::vector<double, std::allocator<double>> &)::[lambda(int, double &) (instance 1)], Kokkos::RangePolicy<Kokkos::Cuda>, double, void>, Kokkos::RangePolicy<Kokkos::Cuda>, Kokkos::InvalidType, Kokkos::Cuda>>(T1)
=========     Read Thread (0,128,0) at 0x5d0 in void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelReduce<Kokkos::Impl::CudaFunctorAdapter<sierra::nalu::HypreUVWLinearSystem::copy_hypre_to_stk(stk::mesh::FieldBase *, std::vector<double, std::allocator<double>> &)::[lambda(int, double &) (instance 1)], Kokkos::RangePolicy<Kokkos::Cuda>, double, void>, Kokkos::RangePolicy<Kokkos::Cuda>, Kokkos::InvalidType, Kokkos::Cuda>>(T1)
=========     Current Value : 0
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x42e69c]
=========                in /sw/summit/cuda/11.4.2/lib64/libcuda.so.1

PaulMullowney avatar Mar 08 '22 23:03 PaulMullowney

@PaulMullowney Was this ever resolved?

jhux2 avatar Apr 29 '22 21:04 jhux2

This was resolved in Nalu-Wind by replacing the parallel_reduce with a parallel_for and returning the integer in a View. The warnings from cuda-memcheck have been fixed in a future version of Kokkos that will include [https://github.com/kokkos/kokkos/issues/4970] It is in the Kokkos develop branch which will eventually make it to Kokkos release -> Trilinos develop -> Trilinos release when we should be able to go back to the parallel_reduce if we want.

overfelt avatar Apr 30 '22 20:04 overfelt

@PaulMullowney is it okay to close this? Sounds like all has been resolved.

psakievich avatar Sep 09 '22 21:09 psakievich

Yes

PaulMullowney avatar Sep 09 '22 21:09 PaulMullowney