Daniel Arndt
Daniel Arndt
> @masterleinad any thoughts on what could trigger test timeouts/hangs from #7080 in Cuda+Serial builds? It's probably hanging when trying to lock `SerialInternal::m_instance_mutex` on the same thread. Hard to diagnose...
> @masterleinad I can run the kernel-logger with one of the tests, would that be sufficient for some type of stacktrace? Probably not. If you could run `Phalanx_tViewOfViews_MPI_1` manually (assuming...
I'm having a hard time reproducing the problem ``` Start 18: Phalanx_tKokkosViewOfViews 18: Test command: /app/trilinos/build/packages/phalanx/test/Kokkos/Phalanx_tKokkosViewOfViews.exe 18: Environment variables: 18: CTEST_KOKKOS_DEVICE_TYPE=gpus 18: Test timeout computed to be: 1500 18: Teuchos::GlobalMPISession::GlobalMPISession():...
> @masterleinad just to confirm, that was a build of Trilinos with kokkos@develop? Yes, ``` cmake -DCMAKE_BUILD_TYPE=Debug -DCMAKE_CXX_COMPILER=/app/kokkos/bin/nvcc_wrapper -DKokkos_SOURCE_DIR_OVERRIDE:STRING=kokkos -DKokkos_ENABLE_CUDA=on -DKokkos_ENABLE_SERIAL=ON -DTrilinos_ENABLE_Phalanx=ON -DPhalanx_ENABLE_TESTS=ON .. ``` with ``` commit aeb9e943715be2fb1fd0062161265787e89bfd67 (HEAD,...
> @masterleinad compiling with mpi enabled may introduce different code paths, can you try a build OpenMPI? I'm seeing the hangs in jobs using openmpi/4.1.1 and 4.0.5 Doesn't seem to...
The backtrace for `TpetraCore_BlockCrsMatrix_MPI_4` looks somewhat like ``` #0 0x00007ffff4f6d66a in ?? () from /usr/lib/x86_64-linux-gnu/libopen-pal.so.40 #1 0x00007ffff4f6f78f in ?? () from /usr/lib/x86_64-linux-gnu/libopen-pal.so.40 #2 0x00007ffff4f1f764 in opal_progress () from /usr/lib/x86_64-linux-gnu/libopen-pal.so.40 #3...
The more relevant backtrace for `TpetraCore_BlockCrsMatrix_MPI_4` is ``` #0 futex_wait (private=0, expected=2, futex_word=0x55555bb46040) at ../sysdeps/nptl/futex-internal.h:146 #1 __GI___lll_lock_wait (futex=futex@entry=0x55555bb46040, private=0) at ./nptl/lowlevellock.c:49 #2 0x00007ffff53d9002 in lll_mutex_lock_optimized (mutex=0x55555bb46040) at ./nptl/pthread_mutex_lock.c:48 #3 ___pthread_mutex_lock...
https://github.com/trilinos/Trilinos/blob/77005adad6d625dbf62009620ffdc4ffa06b9fac/packages/tpetra/core/src/Tpetra_CrsGraph_decl.hpp#L2188 uses a two-argument `deep_copy` inside a kernel which fences and causes the deadlock.
Ultimately, https://github.com/trilinos/Trilinos/blob/master/packages/tpetra/core/src/Tpetra_CrsGraph_def.hpp#L5866 looks suspicious since it calls a non-`KOKKOS_FUNCTION` in a kernel.
I expect that these tests are already failing with the `HPX` backend.