Trilinos icon indicating copy to clipboard operation
Trilinos copied to clipboard

MueLu: Unit test failures, cuda/10.1.105, UVM disabled / Re-enable no-UVM nightlies

Open ndellingwood opened this issue 2 years ago • 5 comments

Bug Report

@trilinos/muelu

Description

In builds of Trilinos with cuda/10.1.05 and Kokkos_ENABLE_CUDA_UVM=OFF the following MueLu unit tests failed:

   26 - MueLu_GeneralBlockSmoothing_MPI_4 (Failed)
   32 - MueLu_DriverDiagonalModifications_MPI_1 (Failed)
   39 - MueLu_CalcRotations_MPI_1 (Failed)
   40 - MueLu_CalcRotations_MPI_4 (Failed)

In the MueLu_GeneralBlockSmoothing_MPI_4 test, this type of error output was emitted:

...
Clearing old data (if any)
Using default factory (SmootherFactory[47] ) for building 'Smoother'.
Level 0
Setup Smoother (MueLu::Ifpack2Smoother{type = SCHWARZ})

p=1: *** Caught standard std::exception of type 'std::runtime_error' :

 Tpetra::Details::WrappedDualView (name = MV::DualView; host use_count = 3; device use_count = 2): Cannot access data on device while a host view is alive

p=2: *** Caught standard std::exception of type 'std::runtime_error' :

 Tpetra::Details::WrappedDualView (name = MV::DualView; host use_count = 3; device use_count = 2): Cannot access data on device while a host view is alive

p=0: *** Caught standard std::exception of type 'std::runtime_error' :

 Tpetra::Details::WrappedDualView (name = MV::DualView; host use_count = 3; device use_count = 2): Cannot access data on device while a host view is alive

p=3: *** Caught standard std::exception of type 'std::runtime_error' :
...

Steps to Reproduce

  1. SHA1: e061ffc182ade4040e0743cd9ff988d9f87ded5b
  2. Configure script: Weaver testbed rhel7W queue
export ATDM_CONFIG_REGISTER_CUSTOM_CONFIG_DIR=${TRILINOS_DIR}/cmake/std/atdm/contributed/weaver
source ${TRILINOS_DIR}/cmake/std/atdm/load-env.sh weaver-cuda-10.1-opt
export ATDM_CONFIG_USE_NINJA=OFF
unset CUDA_LAUNCH_BLOCKING

# Configure
cmake \
 -G"Unix Makefiles" \
 -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
 -DCMAKE_INSTALL_PREFIX="${PWD}/install" \
 -DCMAKE_CXX_STANDARD="14" \
 -DCMAKE_CXX_FLAGS="-pedantic -Wall" \
 -DTrilinos_ENABLE_TESTS=OFF \
 -DTrilinos_ENABLE_ALL_PACKAGES=OFF \
 -DKokkos_ENABLE_CUDA_UVM=OFF \
 -DTrilinos_ENABLE_MueLu=ON \
 -DMueLu_ENABLE_TESTS=ON \
 -DTrilinos_ENABLE_Stokhos=ON \
$TRILINOS_DIR

ndellingwood avatar Aug 15 '22 20:08 ndellingwood

@ndellingwood At the moment, the PR tester isn't testing basically anything in the Tpetra stack with UVM disabled. So stuff like this can sneak through unimpeded.

Once we can get the current AT issues sorted out we might be able to fix this.

csiefer2 avatar Aug 15 '22 21:08 csiefer2

Adding Tpetra label so we can remember start enabling Tpetra stuff on the non-UVM tests

csiefer2 avatar Aug 27 '22 00:08 csiefer2

First try at the autotester submitted...

csiefer2 avatar Sep 19 '22 18:09 csiefer2

@jhux2 I'm not sure if you guys are still tracking MueLu_DriverDiagonalModifications_MPI_1 down, but it can be fixed by commenting out MueLu_FilteredAFactory_def.hpp:240. The const cast of the host view (getLocalRowView on line 294) doesn't trigger the host modify flag in the dual view.

seanofthemillers avatar Sep 23 '22 21:09 seanofthemillers

@seanofthemillers Thank you, that does in fact fix it! I really appreciate the help.

jhux2 avatar Sep 23 '22 21:09 jhux2

Can this be closed?

cgcgcg avatar Dec 03 '22 21:12 cgcgcg

Can this be closed?

@ndellingwood ?

jhux2 avatar Dec 05 '22 19:12 jhux2

Yeah, the updated PR testing isn't flagging tests and the cuda/10.1.105 version is no longer relevant

ndellingwood avatar Dec 05 '22 19:12 ndellingwood