Trilinos icon indicating copy to clipboard operation
Trilinos copied to clipboard

Stokhos: Unit test failures, cuda/10.1.105, UVM disabled

Open ndellingwood opened this issue 2 years ago • 9 comments

Bug Report

@trilinos/stokhos

Description

In builds of Trilinos with cuda/10.1.05 and Kokkos_ENABLE_CUDA_UVM=OFF the following Stokhos unit tests failed:

  160 - Stokhos_TpetraCrsMatrixUQPCEUnitTest_Cuda_MPI_4 (Failed)
  172 - Stokhos_TpetraCrsMatrixMPVectorUnitTest_Cuda_MPI_4 (Failed)
  177 - Stokhos_Linear2D_Diffusion_PCE_Example_MPI_2 (Failed)
  178 - Stokhos_Linear2D_Diffusion_PCE_Interlaced_Example_MPI_2 (Failed)
  180 - Stokhos_Linear2D_Diffusion_PCE_NOX_Example_MPI_2 (Failed)
  181 - Stokhos_Linear2D_Diffusion_GMRES_Mean_Based_MPI_2 (Failed)
  182 - Stokhos_Linear2D_Diffusion_GMRES_AGS_MPI_2 (Failed)
  183 - Stokhos_Linear2D_Diffusion_CG_AGS_MPI_2 (Failed)
  184 - Stokhos_Linear2D_Diffusion_GMRES_GS_MPI_2 (Failed)
  185 - Stokhos_Linear2D_Diffusion_GMRES_AJ_MPI_2 (Failed)
  186 - Stokhos_Linear2D_Diffusion_GMRES_KP_MPI_2 (Failed)
  187 - Stokhos_Linear2D_Diffusion_GS_MPI_2 (Failed)
  188 - Stokhos_Linear2D_Diffusion_JA_MPI_2 (Failed)
  189 - Stokhos_Linear2D_Diffusion_LN_MPI_2 (Failed)
  190 - Stokhos_Linear2D_Diffusion_GSLN_MPI_2 (Failed)
  191 - Stokhos_Linear2D_Diffusion_GMRES_FA_MPI_2 (Failed)
  192 - Stokhos_Linear2D_Diffusion_GMRES_KL_MPI_2 (Failed)
  193 - Stokhos_Linear2D_Diffusion_GMRES_KLR_MPI_2 (Failed)

In the aStokhos_TpetraCrsMatrixUQPCEUnitTest_Cuda_MPI_4 test, the following error output was emitted:

...
7. Tpetra_CrsMatrix_PCE_DS_default_local_ordinal_type_default_global_ordinal_type_CudaWrapperNode_Flatten_UnitTest ... Kokkos::View ERROR: attempt to access inaccessible memory space
Backtrace:
Kokkos::View ERROR: attempt to access inaccessible memory space
...

Steps to Reproduce

  1. SHA1: e061ffc182ade4040e0743cd9ff988d9f87ded5b
  2. Configure script: Weaver testbed rhel7W queue
export ATDM_CONFIG_REGISTER_CUSTOM_CONFIG_DIR=${TRILINOS_DIR}/cmake/std/atdm/contributed/weaver
source ${TRILINOS_DIR}/cmake/std/atdm/load-env.sh weaver-cuda-10.1-opt
export ATDM_CONFIG_USE_NINJA=OFF
unset CUDA_LAUNCH_BLOCKING

# Configure
cmake \
 -G"Unix Makefiles" \
 -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
 -DCMAKE_INSTALL_PREFIX="${PWD}/install" \
 -DCMAKE_CXX_STANDARD="14" \
 -DCMAKE_CXX_FLAGS="-pedantic -Wall" \
 -DTrilinos_ENABLE_TESTS=OFF \
 -DTrilinos_ENABLE_ALL_PACKAGES=OFF \
 -DKokkos_ENABLE_CUDA_UVM=OFF \
 -DTrilinos_ENABLE_MueLu=ON \
 -DTrilinos_ENABLE_Stokhos=ON \
 -DStokhos_ENABLE_TESTS=ON \
$TRILINOS_DIR

ndellingwood avatar Aug 15 '22 20:08 ndellingwood

Unfortunately, I don't think there has been any attempt to make Stokhos work with UVM disabled. It is unclear how much work would be involved in making it work without UVM.

etphipp avatar Aug 15 '22 21:08 etphipp

I think at the minimum we would just have to disable all of the PCE stuff. It works similarly to DFad in Sacado, which is pretty much impossible to make work without UVM.

etphipp avatar Aug 15 '22 21:08 etphipp

Thanks for the update @etphipp ! I'll drop the bug label and disable these tests when configuring with UVM off

ndellingwood avatar Aug 15 '22 21:08 ndellingwood

When I get a chance I will try and figure out a more fine-grained set of tests that need to be disabled or refactored.

etphipp avatar Aug 16 '22 15:08 etphipp

Just in case this may be helpful, just wanted to note that @romintomasetti and I regularly compile Trilinos with Stokhos in cuda with uvm, cuda without uvm, and rocm builds.

On our side:

  • we disable the pce scalar type by setting the cmake variable Stokhos_ENABLE_PCE_Scalar_Type to OFF.
  • we disable the test Kokkos_SG_SpMv, CrsMatrixFree in Stokhos_KokkosArrayKernelsUnitTestDecl.hpp.
  • we disable the specialisations in stokhos/src/kokkos/Cuda/Stokhos_Cuda_CrsMatrix.hpp because they call functionalities that are no longer available in cudasparse. We use cuda 11.6.
  • we add a specialisation of Kokkos::impl::promote for the Sacado::MP::Vector to make the compilation with the hip compiler succeed. This specialisation is still needed even after PR #10779.
  • we add fmin and fmax to the functions that are defined for the Sacado::MP::Vector. This was needed when min and max weren't found correctly in the overload resolution. But this issue with min/max was solved in PR #10779.

When we last compiled with the develop branch from Trilinos a few days ago, all the builds went through and all stokhos tests (up to the one mentioned above and not considering the pce ones) passed.

We'd be happy to provide more details and to collaborate on it :)

maartenarnst avatar Sep 02 '22 06:09 maartenarnst

@trilinos/framework I am working on resolving these failures related to UVM. However, some of these tests appear to already get disabled on at least some architectures and not others and I am trying to understand why. Is there a list of tests somewhere in Tribits that might be disabling these tests?

etphipp avatar Oct 10 '22 21:10 etphipp

@etphipp If you're using the GenConfig tool, the tests that get disabled are in packages/framework/ini-files/config-specs.ini.

jhux2 avatar Oct 11 '22 00:10 jhux2

@etphipp If you're using the GenConfig tool, the tests that get disabled are in packages/framework/ini-files/config-specs.ini.

I'm not using GenConfig. I added a print statement right before the call to TRIBITS_ADD_EXECUTABLE_AND_TEST, so I know CMake is getting to that point, yet the test does appear when I run ctest. My guess is it is getting disabled somewhere else, but I can't figure out where.

etphipp avatar Oct 11 '22 15:10 etphipp

@trilinos/tpetra I am trying to get these tests to work without UVM, but am running into issues with Tpetra::CrsGraph::getLocalRowView() throwing exceptions that a device view is still alive. There are no device views active in the calling code, and I don't see anyway to sync the graph to the host first. Can someone explain how I am supposed to get column indices for a given row out of a graph in a safe manner?

etphipp avatar Oct 11 '22 17:10 etphipp

@maartenarnst Can you send me the diffs or create a PR for the Kokkos::impl::promote specializations you needed for compiling on HIP? PR #11165 should fix everything else.

etphipp avatar Oct 19 '22 22:10 etphipp

Hi @etphipp. That's great that you made the fixes. @romintomasetti and I will integrate your PR in our builds and we'll be sure to report back.

This is the diff for the Kokkos::impl::promote specialisation

From d512fa9ae5ab7ddd9f69b30e1eec4045be1c9e06 Mon Sep 17 00:00:00 2001
From: Maarten Arnst <[email protected]>
Date: Tue, 30 Aug 2022 21:30:28 +0200
Subject: [PATCH] Add kokkos promote specialisation for mp vector in rocm
 builds.

---
 .../src/sacado/kokkos/vector/Sacado_MP_Vector.hpp     | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/packages/stokhos/src/sacado/kokkos/vector/Sacado_MP_Vector.hpp b/packages/stokhos/src/sacado/kokkos/vector/Sacado_MP_Vector.hpp
index adafac26aa6..15ad2a0f1b2 100644
--- a/packages/stokhos/src/sacado/kokkos/vector/Sacado_MP_Vector.hpp
+++ b/packages/stokhos/src/sacado/kokkos/vector/Sacado_MP_Vector.hpp
@@ -2085,6 +2085,17 @@ struct reduction_identity< Sacado::MP::Vector<Storage> > {
 
 }
 
+#ifdef __HIPCC__
+namespace Kokkos {
+    namespace Impl {
+        template <typename Storage>
+        struct promote<Sacado::MP::Vector<Storage>,false> {
+            using type = typename Sacado::MP::Vector<Storage>;
+        };
+    }
+}
+#endif //  __HIPCC__
+
 #endif // HAVE_STOKHOS_SACADO
 
 #endif // SACADO_MP_VECTOR_HPP
-- 
2.32.0 (Apple Git-132)

BTW, the following diff may also be useful to integrate for compiling on HIP. It's related to Stokhos' raise_error function, which uses ifdefs to protect a throw in cuda device code, and the diff basically adds the corresponding HIP case:

From e1f0e7f47dc4036b4f58093a6f211fe803b442e8 Mon Sep 17 00:00:00 2001
From: Maarten Arnst <[email protected]>
Date: Tue, 6 Sep 2022 19:13:53 +0200
Subject: [PATCH] Generalise (to rocm) raise_error function in stokhos.

---
 packages/stokhos/src/sacado/kokkos/Kokkos_View_Utils.hpp | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/packages/stokhos/src/sacado/kokkos/Kokkos_View_Utils.hpp b/packages/stokhos/src/sacado/kokkos/Kokkos_View_Utils.hpp
index 9db40e0bd6b..80a64967ad3 100644
--- a/packages/stokhos/src/sacado/kokkos/Kokkos_View_Utils.hpp
+++ b/packages/stokhos/src/sacado/kokkos/Kokkos_View_Utils.hpp
@@ -53,11 +53,9 @@ namespace Impl {
 KOKKOS_INLINE_FUNCTION
 void raise_error(const char *msg)
 {
-#if defined(__CUDACC__) && defined(__CUDA_ARCH__)
-  Kokkos::abort(msg);
-#else
-  throw std::runtime_error(msg);
-#endif
+    KOKKOS_IF_ON_HOST(throw std::runtime_error(msg);)
+
+    KOKKOS_IF_ON_DEVICE(Kokkos::abort(msg);)
 }
 
 template< class T , class Device > struct RebindStokhosStorageDevice ;
-- 
2.25.1

maartenarnst avatar Oct 21 '22 07:10 maartenarnst

Thanks @maartenarnst! These changes are in PR #11198

etphipp avatar Oct 26 '22 22:10 etphipp

I believe this is resolved now so closing. If you still run into trouble, please re-open.

etphipp avatar Nov 28 '22 21:11 etphipp