Trilinos
Trilinos copied to clipboard
Framework: Trilinos PR ats2 build using cuda-10.1.243 on 'vortex' is irrelevant, customers now using cuda-11.2.152
Bug Report
@trilinos/framework, @rppawlo, @ccober6, @e10harvey
Internal issues
Description
The current Trilinos PR ats2 CUDA build which is using cuda-10.1.243 on 'vortex' is irrelevant because customers (e.g. SPARC) have been using cuda-11.2.152 on 'vortex' since April 26, 2022
Given, as shown in this SPARC CDash query and this SPARC CDash query, SPARC switched to using cuda-11.2.152 on 2022-04-26 ...
Given the move to C++17 and requiring CUDA 11 and the long known problems on this system killing PR builds:
- #6861
- #7122
...
Given that a Trilinos PR CUDA 11 build has been running for some time on 'ascicgpu' machines ...
Can we ask that this problematic Trilinos PR build be removed from Trilinos PR testing? If you want to keep it for develop-to-master merge testing (and then perhaps just run as an informational build), then fine. But please remove this from blocking the merges of PRs.
CUDA 10 will be removed as part of the C++17 transition. CUDA 10 and CUDA 11 currently result in a different set of package enables. In order to preserve code coverage for the cuda toolchain, I suggest waiting until the C++17 transition is complete before disabling CUDA 10. That being said, we can certainly disable CUDA 10 now if the product leaders want that.
What important Trilinos customer currently needs CUDA 10 support from Trilinos?
What important Trilinos customer currently needs CUDA 10 support from Trilinos?
@ccober6 ?
From my perspective, we don't want to loose code coverage with the CUDA toolchain by disabling CUDA 10 before the C++17 transition.
Vortex never should have been used to run PR builds blocking people's PRs from merging. That ATS2 system has too many known problems to subject PR testing to this system.
Plus, it is very difficult for regular developers to reproduce full builds and test runs on that system.
Since there are a bunch of packages disabled in the cuda 11 build (STK, NOX, Panzer at least), we can't drop the cuda 10 build yet until they are covered. Currently empire is still using cuda 10 for all its build and testing. They are working on changing to cuda 11 now.
https://github.com/trilinos/Trilinos/pull/11087
Currently empire is still using cuda 10 for all its build and testing. They are working on changing to cuda 11 now.
@rppawlo, what is going to happen to EMPIRE when Trilinos upgrades Kokkos that requires C++17 and Trilinos switches to require C++17? Will EMPIRE no longer be able to use Trilinos 'develop'?
Does Kokkos that requires C++17 need to be integrated before Trilinos 14.0 is branched?
I am almost 100% certain that the mass test failures on vortex blocking the merge of my PRs #11093 and #11099 have nothing to do with the changes in those PRs (other than triggering the enable of all Trilinos packages).
@bartlettroscoe - David is working on updating the empire builds to c++17 and new TPLs stacks as we speak. The plan is to have it done near the end of October (hopefully earlier). We will continue to use the develop branch, but we may have a small gap in daily updating depending on the timing.
Trilinos 14.0 will include the Kokkos 4.0 release and c++17 requirement. It was timed to happen together.
Vortex never should have been used to run PR builds blocking people's PRs from merging. That ATS2 system has too many known problems to subject PR testing to this system.
Plus, it is very difficult for regular developers to reproduce full builds and test runs on that system.
Is there another PR testing machine that is a good proxy for ATS2?
Is there another PR testing machine that is a good proxy for ATS2?
@jhux2, not that I now of.
But be clear, I am not saying don't run automated builds on ATS2 'vortex'. I am saying don't block PR merges based in results on this flaky system. What I am saying is to use it for nightly testing. The stability of ATS2 'vortex' is okay for nightly testing in my experience. (For example, the ATDM nightly build monitoring scripts filter out failing tests for the mass failures described in #6861 and #7122 so they don't hamper the ability to effectively monitor those builds and detect new test failures.) But that does not work for PR testing. You can't just ignore 3K tests that crash due to these system issues for a PR build (because that means that you have no feedback from 3K tests). But as long as this does not happen to frequently for the nightly build, this is quite workable.
@jhux2 The current use of vortex was a stopgap measure to recover PR testing. When weaver burned out, vortex was the only option at the time. It was intended to be temporary until the new asc machines were set up. I believe the framework team is working on moving the builds over, but the GenConfig builds on the new asc platforms are currently disabling too many packages. STK, Panzer and a number of other packages are disabled. Once we can reenable all packages in the new cuda 11 builds, we should be able to drop vortex from PR testing. We all know it is not the right platform for PR testing - it was a desperation move because PR testing was down for almost 2 weeks at that time and that was the only other working cuda autotester build. The transition to cuda 11 should come at the end of this month, so hopefully we will be in a better place soon...
@bartlettroscoe @rppawlo Thanks, that makes a lot of sense.
I would also add that there are a bunch of individual tests diabled on CUDA 11 that really need to get resolved before we can say with confidence that Trilinos really works with Trilinos 11. In addition to half the packages being disabled as @rppawlo noted :)
Why can't we set up CUDA 10.1.243 builds on the 'ascicgpu' machines? There are ATDM Trilinos builds running just fine on 'ascicgpu' machines as shown here:
- https://testing.sandia.gov/cdash/index.php?project=Trilinos&date=2022-10-07&filtercount=2&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm&field2=buildname&compare2=63&value2=cuda
@rppawlo, note that these ATDM RHEL7 CUDA 10.1.243 builds are all passing the configure and build when using find_package(CUDAToolkit) as you can see, for example, the build Trilinos-atdm-sems-rhel7-cuda-10.1-Volta70-complex-shared-release-debug showing:
Processing enabled external package/TPL: CUDA (enabled explicitly, disable with -DTPL_ENABLE_CUDA=OFF)
-- Found CUDAToolkit: /projects/sems/install/rhel7-x86_64/sems/compiler/cuda/10.1/base/include (found version "10.1.243")
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
and for Trilinos-atdm-cee-rhel7_cuda-10.1.243_gnu-7.2.0_openmpi-4.0.3_shared_dbg showing:
Processing enabled external package/TPL: CUDA (enabled explicitly, disable with -DTPL_ENABLE_CUDA=OFF)
-- Found CUDAToolkit: /projects/sierra/linux_rh7/SDK/compilers/nvidia/cuda_10.1.243/include (found version "10.1.243")
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
So that is two completely different stacks of CUDA 10.1.243 working with updated Trilinos 'develop'.
I would also add that there are a bunch of individual tests diabled on CUDA 11 that really need to get resolved before we can say with confidence that Trilinos really works with [CUDA] 11. In addition to half the packages being disabled as @rppawlo noted :)
Then someone needs to fix that ASAP.
@bartlettroscoe @csiefer2 - I've got the cuda 11 errors figured out. The cuda 11 build is linking against a blas/lapack that is threaded and built for openmp. The new build uses this library for both blas and lapack:
libopenblas.so.0 => /projects/sems/install/rhel7-x86_64/sems/v2/tpl/openblas/0.3.10/gcc/10.1.0/base/ceyagsr/lib/libopenblas.so.0 (0x00007f35d7b4c000)
while the old cuda 10 build points to the machine installed blas:
liblapack.so.3 => /lib64/liblapack.so.3 (0x00007f954a753000)
libblas.so.3 => /lib64/libblas.so.3 (0x00007f954a4fa000)
In genconfig ini files, we have this (note the required openmp library):
set(TPL_BLAS_LIBRARY_DIRS $ENV{CBLAS_ROOT}/lib CACHE STRING "from .ini configuration" FORCE)
set(TPL_BLAS_LIBRARIES $ENV{CBLAS_ROOT}/lib/libblas.a;-L$ENV{CBLAS_ROOT}/lib;-lgfortran;-lgomp;-lm CACHE STRING "from .ini configuration" FORCE)
set(TPL_LAPACK_LIBRARY_DIRS $ENV{LAPACK_ROOT}/lib CACHE STRING "from .ini configuration" FORCE)
set(TPL_LAPACK_LIBRARIES -L$ENV{LAPACK_ROOT}/lib;-lgfortran;-lgomp;$ENV{LAPACK_ROOT}/lib/liblapack.a CACHE STRING "from .ini configuration" FORCE)
If I comment out the lines above and use the two lines below, then all NOX tests are passing with the new cuda 11 genconfig build. Looks like we can probably reenable everything.
set(TPL_BLAS_LIBRARIES /lib64/libblas.so.3 CACHE STRING "from .ini configuration" FORCE)
set(TPL_LAPACK_LIBRARIES /lib64/liblapack.so.3 CACHE STRING "from .ini configuration" FORCE)
This explains why NOX tests that don't use anything but lapack (no kokkos, no tpetra, no epetra) were failing. The lapack linear solves are threaded and generate race conditions in our build. I got a clue when valgrind slowed the tests down and all of a sudden they started passing :) The nox examples were generating garbage directions from the lapack linear solve.
@trilinos/framework @e10harvey @jwillenbring - you either need to find a normal non-openmp module for blas and lapack or point to the default machine installed one in rhel as I did above.
That might explain the Ifpack (not Ifpack2) fails as well.
@rppawlo I'd recommend just submitting a PR for changing the ini files
@rppawlo I'd recommend just submitting a PR for changing the ini files
@rppawlo, let me know if you want any help with that. It might even be good to pair program a quick session with this.
Now that I have rephrased this Issue, I will use this as the pinned issue to alert Trilinos developers to this problem (to try to help avoid periodic inquiries like https://github.com/trilinos/Trilinos/pull/11175#issuecomment-1292439242).
Looks like the 'ats2' builds on 'vortex are gone after the PR/MM builds upgrade that completed on the 29th. For example, you see 'ats2' builds on 'vortex' nodes up till Nov 28 in this query but there are no 'ats2' builds after that in the last three days as shown here, so this problematic build is now gone!
Closing as complete!