Albany
Albany copied to clipboard
Broken tests on Weaver
A bunch of the nightly tests on Weaver started failing:
https://sems-cdash-son.sandia.gov/cdash/test/2685897
[weaver3:23006:0:23006] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2000bf9cfa80)
[weaver3:23004:0:23004] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x200093dcfc80)
[weaver3:23005:0:23005] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2000bfbcfc80)
==== backtrace (tid: 23006) ====
0 0x0000000000049bb4 uct_ep_am_short() /ascldap/users/sdhammo/git/ucx-github-repo/src/uct/api/uct.h:2009
1 0x0000000000049bb4 ucp_tag_send_inline() /ascldap/users/sdhammo/git/ucx-github-repo/src/ucp/tag/tag_send.c:155
2 0x0000000000049bb4 ucp_tag_send_nbr() /ascldap/users/sdhammo/git/ucx-github-repo/src/ucp/tag/tag_send.c:229
3 0x0000000000006808 mca_pml_ucx_send() ???:0
4 0x00000000000c81a8 MPI_Send() ???:0
5 0x00000000162f4978 Teuchos::(anonymous namespace)::sendImpl<double>() ???:0
6 0x00000000162d4218 Teuchos::send<int, double>() ???:0
7 0x0000000015b4799c Tpetra::Details::DistributorActor::doPosts<Kokkos::View<double const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<0u> >, Kokkos::View<double*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> >() ???:0
8 0x0000000015b49a24 Tpetra::DistObject<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::doPosts() ???:0
9 0x0000000015b4c0a4 Tpetra::DistObject<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::beginTransfer() ???:0
10 0x0000000015b4f470 Tpetra::DistObject<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::beginExport() ???:0
11 0x0000000015b4f6dc Tpetra::DistObject<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::doExport() ???:0
12 0x0000000015820910 Tpetra::FEMultiVector<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::doOwnedPlusSharedToOwned() ???:0
13 0x0000000015820d7c Tpetra::FEMultiVector<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::endFill() ???:0
14 0x000000001582109c Tpetra::FEMultiVector<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::endAssembly() ???:0
15 0x0000000010f2a4b8 Albany::ThyraCrsMatrixFactory::fillComplete() ???:0
16 0x0000000010e76214 Albany::STKDiscretization::computeGraphs() ???:0
17 0x0000000010e89994 Albany::STKDiscretization::updateMesh() ???:0
18 0x0000000010b04ef4 Albany::DiscretizationFactory::completeDiscSetup() ???:0
19 0x0000000010b05218 Albany::DiscretizationFactory::createDiscretization() ???:0
==== backtrace (tid: 23004) ====
0 0x0000000000049bb4 uct_ep_am_short() /ascldap/users/sdhammo/git/ucx-github-repo/src/uct/api/uct.h:2009
1 0x0000000000049bb4 ucp_tag_send_inline() /ascldap/users/sdhammo/git/ucx-github-repo/src/ucp/tag/tag_send.c:155
2 0x0000000000049bb4 ucp_tag_send_nbr() /ascldap/users/sdhammo/git/ucx-github-repo/src/ucp/tag/tag_send.c:229
3 0x0000000000006808 mca_pml_ucx_send() ???:0
4 0x00000000000c81a8 MPI_Send() ???:0
5 0x00000000162f4978 Teuchos::(anonymous namespace)::sendImpl<double>() ???:0
6 0x00000000162d4218 Teuchos::send<int, double>() ???:0
7 0x0000000015b4799c Tpetra::Details::DistributorActor::doPosts<Kokkos::View<double const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<0u> >, Kokkos::View<double*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> >() ???:0
8 0x0000000015b49a24 Tpetra::DistObject<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::doPosts() ???:0
9 0x0000000015b4c0a4 Tpetra::DistObject<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::beginTransfer() ???:0
10 0x0000000015b4f470 Tpetra::DistObject<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::beginExport() ???:0
11 0x0000000015b4f6dc Tpetra::DistObject<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::doExport() ???:0
12 0x0000000015820910 Tpetra::FEMultiVector<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::doOwnedPlusSharedToOwned() ???:0
13 0x0000000015820d7c Tpetra::FEMultiVector<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::endFill() ???:0
14 0x000000001582109c Tpetra::FEMultiVector<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::endAssembly() ???:0
15 0x0000000010f2a4b8 Albany::ThyraCrsMatrixFactory::fillComplete() ???:0
16 0x0000000010e76214 Albany::STKDiscretization::computeGraphs() ???:0
17 0x0000000010e89994 Albany::STKDiscretization::updateMesh() ???:0
18 0x0000000010b04ef4 Albany::DiscretizationFactory::completeDiscSetup() ???:0
19 0x0000000010b05218 Albany::DiscretizationFactory::createDiscretization() ???:0
20 0x00000000106db73c Albany::Application::createDiscretization() ???:0
21 0x00000000106eb7d0 Albany::Application::Application() ???:0
22 0x000000001032f534 Albany::SolverFactory::createApplication() ???:0
23 0x000000001014f2c0 main() ???:0
24 0x0000000000025100 generic_start_main.isra.0() libc-start.c:0
25 0x00000000000252f4 __libc_start_main() ???:0
20 0x00000000106db73c Albany::Application::createDiscretization() ???:0
21 0x00000000106eb7d0 Albany::Application::Application() ???:0
22 0x000000001032f534 Albany::SolverFactory::createApplication() ???:0
23 0x000000001014f2c0 main() ???:0
24 0x0000000000025100 generic_start_main.isra.0() libc-start.c:0
25 0x00000000000252f4 __libc_start_main() ???:0
=================================
=================================
@jewatkins , @mcarlson801 : could one of you please have a look?
I'll take a look at this when I have a chance and see what I can find.
Excellent, thanks, @mcarlson801 !
I've confirmed that this issue comes from the recent Trilinos changes on July 18th related to Tribits. The failing tests seem to be fixed after updating cmake
to version >=3.23. This is failing on Weaver since the newest version of cmake
available through modules is 3.21 (and I think currently the module file loads 3.19).
I've confirmed that this issue comes from the recent Trilinos changes on July 18th related to Tribits. The failing tests seem to be fixed after updating
cmake
to version >=3.23. This is failing on Weaver since the newest version ofcmake
available through modules is 3.21 (and I think currently the module file loads 3.19).
Could you send weaver-help an email letting them know we need a newer version of cmake? We can then start loading the newer version.
Could you send weaver-help an email letting them know we need a newer version of cmake? We can then start loading the newer version.
Yep, just sent it out.
Weaver now has cmake 3.23 and I've updated the weaver_cuda_modules.sh file in docs/dashboards to load it. I assume it needs to be updated somewhere else as well since it hasn't yet reflected in the nightly tests. Can anyone point me to the modules file that still needs to be updated?
Thanks @mcarlson801 ! I just updated the modules used in the nightlies on the machine. Will see what happens with them tomorrow.
@mcarlson801 : were the changes in modules supposed to fix the broken tests? It doesn't look like they did.
@mcarlson801 : were the changes in modules supposed to fix the broken tests? It doesn't look like they did.
Hmmm, that fixed it for my build. I'll take another look to see what I missed.
The nightlies occur here: /home/projects/albany/nightlyCDashWeaver. You should be able to check the log files there.
Also, this is the cronjob that runs on weaver: https://github.com/sandialabs/Albany/blob/master/doc/dashboards/weaver.sandia.gov/cronjob .
I followed the procedure outlined in https://github.com/trilinos/Trilinos/issues/10774#issuecomment-1189644050 for the Weaver build with CMake 3.23 and got the following error. @bartlettroscoe Any suggestions for fixing this?
Target "Albany" has LINK_LIBRARIES_ONLY_TARGETS enabled, but it links to:
mpi_usempif08
which is not a target. Possible reasons include:
* There is a typo in the target name.
* A find_package call is missing for an IMPORTED target.
* An ALIAS target is missing.
CMake Error at src/CMakeLists.txt:779 (target_link_libraries):
Target "AlbanyAnalysis" has LINK_LIBRARIES_ONLY_TARGETS enabled, but it
links to:
mpi_usempif08
which is not a target. Possible reasons include:
* There is a typo in the target name.
* A find_package call is missing for an IMPORTED target.
* An ALIAS target is missing.
@mcarlson801, can you please attach the Trilinos configure script, the cmake STDOUT&STDERR, and the CMakeCache.txt file and I will take a look? Otherwise, where is mpi_usempif08
coming from?
@mcarlson801, can you please attach the Trilinos configure script, the cmake STDOUT&STDERR, and the CMakeCache.txt file and I will take a look? Otherwise, where is
mpi_usempif08
coming from?
@bartlettroscoe Here's everything for Trilinos and Albany: Albany_CMakeCache.txt albany_configure_output.txt do-cmake-albany.txt do-cmake-weaver-trilinos.txt loaded_modules_weaver.txt Trilinos_CMakeCache.txt trilinos_configure_output.txt
@mcarlson801, can you please attach the Trilinos configure script, the cmake STDOUT&STDERR, and the CMakeCache.txt file and I will take a look? Otherwise, where is
mpi_usempif08
coming from?
@bartlettroscoe Is Trilinos policy now to use "raw" compilers (e.g., g++, or icpc) and link against MPI libraries (properly wrapped in imported/interface targets)? If so, that might hint to to us still using mpicxx and the likes, rather than g++/icpc. Perhaps we need to switch to g++/icpc, and link against MPI (which might be superfluous, since Trilinos already links against it, so we should get that PUBLIC library automagically)?
@bartlettroscoe Is Trilinos policy now to use "raw" compilers (e.g., g++, or icpc) and link against MPI libraries (properly wrapped in imported/interface targets)?
@bartgol, absolutely not. The Trilinos policy is to, by default, use the MPI compiler wrappers provided by the MPI installation as per section a. Configuring build using MPI compiler wrappers at:
- https://docs.trilinos.org/files/TrilinosBuildReference.html#configuring-with-mpi-support
However, it is possible to use the raw compilers but then you are on your own to figure out the various compile and link flags to make this work (see the section b. Configuring to build using raw compilers and flags/libraries).
Unless you have a really good reason, use the MPI compiler wrappers since that is how they design these systems to be used.
Ok, thanks!
How did this get fixed?
Quite honestly, I have no idea...
Okay, as shown in this query the test started passing on testing day 2022-08-13. There was a TriBITS-related Trilinos merge of PR trilinos/Trilinos#10813 on 2022-08-12 which also merged the PR trilinos/Trilinos#10791 (which shows as closed but really those changes were merged to 'develop'). It is very possible that one of those changes may have impacted this CUDA build but I would have thought that it would had fixed a build failure, not a test failure (see trilinos/Trilinos#10791).
So it seems it was likely a fix to TriBITS merged to Trilinos 'develop' on 2022-08-12 with PRs trilinos/Trilinos#10813 and trilinos/Trilinos#10791 on 2022-08-12 that caused this test to pass.