Trilinos
Trilinos copied to clipboard
Stokhos: Unit test failures, cuda/10.1.105, UVM disabled
Bug Report
@trilinos/stokhos
Description
In builds of Trilinos with cuda/10.1.05 and Kokkos_ENABLE_CUDA_UVM=OFF the following Stokhos unit tests failed:
160 - Stokhos_TpetraCrsMatrixUQPCEUnitTest_Cuda_MPI_4 (Failed)
172 - Stokhos_TpetraCrsMatrixMPVectorUnitTest_Cuda_MPI_4 (Failed)
177 - Stokhos_Linear2D_Diffusion_PCE_Example_MPI_2 (Failed)
178 - Stokhos_Linear2D_Diffusion_PCE_Interlaced_Example_MPI_2 (Failed)
180 - Stokhos_Linear2D_Diffusion_PCE_NOX_Example_MPI_2 (Failed)
181 - Stokhos_Linear2D_Diffusion_GMRES_Mean_Based_MPI_2 (Failed)
182 - Stokhos_Linear2D_Diffusion_GMRES_AGS_MPI_2 (Failed)
183 - Stokhos_Linear2D_Diffusion_CG_AGS_MPI_2 (Failed)
184 - Stokhos_Linear2D_Diffusion_GMRES_GS_MPI_2 (Failed)
185 - Stokhos_Linear2D_Diffusion_GMRES_AJ_MPI_2 (Failed)
186 - Stokhos_Linear2D_Diffusion_GMRES_KP_MPI_2 (Failed)
187 - Stokhos_Linear2D_Diffusion_GS_MPI_2 (Failed)
188 - Stokhos_Linear2D_Diffusion_JA_MPI_2 (Failed)
189 - Stokhos_Linear2D_Diffusion_LN_MPI_2 (Failed)
190 - Stokhos_Linear2D_Diffusion_GSLN_MPI_2 (Failed)
191 - Stokhos_Linear2D_Diffusion_GMRES_FA_MPI_2 (Failed)
192 - Stokhos_Linear2D_Diffusion_GMRES_KL_MPI_2 (Failed)
193 - Stokhos_Linear2D_Diffusion_GMRES_KLR_MPI_2 (Failed)
In the aStokhos_TpetraCrsMatrixUQPCEUnitTest_Cuda_MPI_4
test, the following error output was emitted:
...
7. Tpetra_CrsMatrix_PCE_DS_default_local_ordinal_type_default_global_ordinal_type_CudaWrapperNode_Flatten_UnitTest ... Kokkos::View ERROR: attempt to access inaccessible memory space
Backtrace:
Kokkos::View ERROR: attempt to access inaccessible memory space
...
Steps to Reproduce
- SHA1: e061ffc182ade4040e0743cd9ff988d9f87ded5b
- Configure script: Weaver testbed rhel7W queue
export ATDM_CONFIG_REGISTER_CUSTOM_CONFIG_DIR=${TRILINOS_DIR}/cmake/std/atdm/contributed/weaver
source ${TRILINOS_DIR}/cmake/std/atdm/load-env.sh weaver-cuda-10.1-opt
export ATDM_CONFIG_USE_NINJA=OFF
unset CUDA_LAUNCH_BLOCKING
# Configure
cmake \
-G"Unix Makefiles" \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DCMAKE_INSTALL_PREFIX="${PWD}/install" \
-DCMAKE_CXX_STANDARD="14" \
-DCMAKE_CXX_FLAGS="-pedantic -Wall" \
-DTrilinos_ENABLE_TESTS=OFF \
-DTrilinos_ENABLE_ALL_PACKAGES=OFF \
-DKokkos_ENABLE_CUDA_UVM=OFF \
-DTrilinos_ENABLE_MueLu=ON \
-DTrilinos_ENABLE_Stokhos=ON \
-DStokhos_ENABLE_TESTS=ON \
$TRILINOS_DIR
Unfortunately, I don't think there has been any attempt to make Stokhos work with UVM disabled. It is unclear how much work would be involved in making it work without UVM.
I think at the minimum we would just have to disable all of the PCE stuff. It works similarly to DFad in Sacado, which is pretty much impossible to make work without UVM.
Thanks for the update @etphipp ! I'll drop the bug label and disable these tests when configuring with UVM off
When I get a chance I will try and figure out a more fine-grained set of tests that need to be disabled or refactored.
Just in case this may be helpful, just wanted to note that @romintomasetti and I regularly compile Trilinos with Stokhos in cuda with uvm, cuda without uvm, and rocm builds.
On our side:
- we disable the pce scalar type by setting the cmake variable
Stokhos_ENABLE_PCE_Scalar_Type
toOFF
. - we disable the test
Kokkos_SG_SpMv, CrsMatrixFree
inStokhos_KokkosArrayKernelsUnitTestDecl.hpp
. - we disable the specialisations in
stokhos/src/kokkos/Cuda/Stokhos_Cuda_CrsMatrix.hpp
because they call functionalities that are no longer available in cudasparse. We use cuda 11.6. - we add a specialisation of
Kokkos::impl::promote
for theSacado::MP::Vector
to make the compilation with the hip compiler succeed. This specialisation is still needed even after PR #10779. - we add
fmin
andfmax
to the functions that are defined for theSacado::MP::Vector
. This was needed whenmin
andmax
weren't found correctly in the overload resolution. But this issue withmin
/max
was solved in PR #10779.
When we last compiled with the develop branch from Trilinos a few days ago, all the builds went through and all stokhos tests (up to the one mentioned above and not considering the pce ones) passed.
We'd be happy to provide more details and to collaborate on it :)
@trilinos/framework I am working on resolving these failures related to UVM. However, some of these tests appear to already get disabled on at least some architectures and not others and I am trying to understand why. Is there a list of tests somewhere in Tribits that might be disabling these tests?
@etphipp If you're using the GenConfig tool, the tests that get disabled are in packages/framework/ini-files/config-specs.ini
.
@etphipp If you're using the GenConfig tool, the tests that get disabled are in
packages/framework/ini-files/config-specs.ini
.
I'm not using GenConfig. I added a print statement right before the call to TRIBITS_ADD_EXECUTABLE_AND_TEST
, so I know CMake is getting to that point, yet the test does appear when I run ctest. My guess is it is getting disabled somewhere else, but I can't figure out where.
@trilinos/tpetra I am trying to get these tests to work without UVM, but am running into issues with Tpetra::CrsGraph::getLocalRowView()
throwing exceptions that a device view is still alive. There are no device views active in the calling code, and I don't see anyway to sync the graph to the host first. Can someone explain how I am supposed to get column indices for a given row out of a graph in a safe manner?
@maartenarnst Can you send me the diffs or create a PR for the Kokkos::impl::promote
specializations you needed for compiling on HIP? PR #11165 should fix everything else.
Hi @etphipp. That's great that you made the fixes. @romintomasetti and I will integrate your PR in our builds and we'll be sure to report back.
This is the diff for the Kokkos::impl::promote
specialisation
From d512fa9ae5ab7ddd9f69b30e1eec4045be1c9e06 Mon Sep 17 00:00:00 2001
From: Maarten Arnst <[email protected]>
Date: Tue, 30 Aug 2022 21:30:28 +0200
Subject: [PATCH] Add kokkos promote specialisation for mp vector in rocm
builds.
---
.../src/sacado/kokkos/vector/Sacado_MP_Vector.hpp | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/packages/stokhos/src/sacado/kokkos/vector/Sacado_MP_Vector.hpp b/packages/stokhos/src/sacado/kokkos/vector/Sacado_MP_Vector.hpp
index adafac26aa6..15ad2a0f1b2 100644
--- a/packages/stokhos/src/sacado/kokkos/vector/Sacado_MP_Vector.hpp
+++ b/packages/stokhos/src/sacado/kokkos/vector/Sacado_MP_Vector.hpp
@@ -2085,6 +2085,17 @@ struct reduction_identity< Sacado::MP::Vector<Storage> > {
}
+#ifdef __HIPCC__
+namespace Kokkos {
+ namespace Impl {
+ template <typename Storage>
+ struct promote<Sacado::MP::Vector<Storage>,false> {
+ using type = typename Sacado::MP::Vector<Storage>;
+ };
+ }
+}
+#endif // __HIPCC__
+
#endif // HAVE_STOKHOS_SACADO
#endif // SACADO_MP_VECTOR_HPP
--
2.32.0 (Apple Git-132)
BTW, the following diff may also be useful to integrate for compiling on HIP. It's related to Stokhos' raise_error function, which uses ifdefs to protect a throw in cuda device code, and the diff basically adds the corresponding HIP case:
From e1f0e7f47dc4036b4f58093a6f211fe803b442e8 Mon Sep 17 00:00:00 2001
From: Maarten Arnst <[email protected]>
Date: Tue, 6 Sep 2022 19:13:53 +0200
Subject: [PATCH] Generalise (to rocm) raise_error function in stokhos.
---
packages/stokhos/src/sacado/kokkos/Kokkos_View_Utils.hpp | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)
diff --git a/packages/stokhos/src/sacado/kokkos/Kokkos_View_Utils.hpp b/packages/stokhos/src/sacado/kokkos/Kokkos_View_Utils.hpp
index 9db40e0bd6b..80a64967ad3 100644
--- a/packages/stokhos/src/sacado/kokkos/Kokkos_View_Utils.hpp
+++ b/packages/stokhos/src/sacado/kokkos/Kokkos_View_Utils.hpp
@@ -53,11 +53,9 @@ namespace Impl {
KOKKOS_INLINE_FUNCTION
void raise_error(const char *msg)
{
-#if defined(__CUDACC__) && defined(__CUDA_ARCH__)
- Kokkos::abort(msg);
-#else
- throw std::runtime_error(msg);
-#endif
+ KOKKOS_IF_ON_HOST(throw std::runtime_error(msg);)
+
+ KOKKOS_IF_ON_DEVICE(Kokkos::abort(msg);)
}
template< class T , class Device > struct RebindStokhosStorageDevice ;
--
2.25.1
Thanks @maartenarnst! These changes are in PR #11198
I believe this is resolved now so closing. If you still run into trouble, please re-open.