GEOS Update CI platforms and compilers

We need to update CI platforms and compilers.

On LLNL/Quartz we have:
- gcc12
- clang14
On LLNL/Lassen we have:
- gcc8
- clang14
- cuda10
- cuda11
- cuda12
On TotalEnergies/Pangea3 we have:
- gcc8
- gcc9
- cuda10
- cuda11
On Frontier we have:
- clang??
- rocm??
On ElCap we will have:
- clang16??
- rocm ??

Proposed Permutations:

ubuntu22
- gcc11
- gcc11 + cuda11
- clang14
- clang14 + cuda11
TOSS4 (built on RHEL8.8)
- Which linux distribution? ubi8.8?
- gcc12
- gcc12 - cuda12
- clang15
- clang15 - cuda12
TotalEnergies/Cypress
- gcc8
- gcc10
- gcc12
- cuda10
- cuda11
- cuda12

Dec 21 '23 18:12 rrsettgast

Chevron is currently using

GCC 11.2 and 11.4 (well tested and broadly used)
GCC 13.2 (not as extensively tested but no build or run failures so far)
OpenMPI HPC-X (v14.1 in broad use and some v17.1)

For GPU GEOS we have been using

CUDA 11.2, 11.4 : GEOS stopped building with GCC 11.x passed commit point 95aea4cb2 (we'll be testing with GCC 12.x)
HPC-X v14.1 (mostly)

Jan 11 '24 15:01 drmichaeltcvx

For GPU GEOS we have been using

CUDA 11.2, 11.4 : GEOS stopped building with GCC 11.x passed commit point 95aea4c (we'll be testing with GCC 12.x)

I bet there's an issue there

  GEOS_HOST_DEVICE
  virtual real64 getShearModulus( localIndex const k ) const override final
  {
    return std::max( std::max( m_c44[k], m_c55[k] ), m_c66[k] );
  }

you can't call a std function on device so its normal. This should not have been merged. You should use LvArray::math::max instead. See https://github.com/GEOS-DEV/GEOS/pull/2927 e.g. @CusiniM Maybe should we be stricter on our review process? And also, I do not understand how it can go through the CI. Maybe some over-relaxed compilations parameters?

Jan 11 '24 18:01 TotoGaz

For GPU GEOS we have been using

CUDA 11.2, 11.4 : GEOS stopped building with GCC 11.x passed commit point 95aea4c (we'll be testing with GCC 12.x)

I bet there's an issue there
  GEOS_HOST_DEVICE
  virtual real64 getShearModulus( localIndex const k ) const override final
  {
    return std::max( std::max( m_c44[k], m_c55[k] ), m_c66[k] );
  }
you can't call a std function on device so its normal. This should not have been merged. You should use LvArray::math::max instead. See #2927 e.g. @CusiniM Maybe should we be stricter on our review process? And also, I do not understand how it can go through the CI. Maybe some over-relaxed compilations parameters?

How it passed the CI beats me...

Jan 11 '24 19:01 CusiniM

For GPU GEOS we have been using

CUDA 11.2, 11.4 : GEOS stopped building with GCC 11.x passed commit point 95aea4c (we'll be testing with GCC 12.x)

I bet there's an issue there
  GEOS_HOST_DEVICE
  virtual real64 getShearModulus( localIndex const k ) const override final
  {
    return std::max( std::max( m_c44[k], m_c55[k] ), m_c66[k] );
  }
you can't call a std function on device so its normal. This should not have been merged. You should use LvArray::math::max instead. See #2927 e.g. @CusiniM Maybe should we be stricter on our review process? And also, I do not understand how it can go through the CI. Maybe some over-relaxed compilations parameters?

Thanks, Thomas, but those specific seems to be already fixed in https://github.com/GEOS-DEV/GEOS/pull/2812, but build still fails for Michael.

Jan 11 '24 20:01 paveltomin

Here is where the build process (host comp: GCC11.X and 12.X) fails at the link stage:

Consolidate compiler generated dependencies of target testToolchain
make[3]: Leaving directory `/dev/shm/mtml/src/GEOS/GEOS/build-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo'
make  -f coreComponents/unitTests/toolchain/CMakeFiles/testToolchain.dir/build.make coreComponents/unitTests/toolchain/CMakeFiles/testToolchain.dir/build
make[3]: Entering directory `/dev/shm/mtml/src/GEOS/GEOS/build-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo'
[100%] Linking CUDA device code CMakeFiles/testToolchain.dir/cmake_device_link.o
cd /dev/shm/mtml/src/GEOS/GEOS/build-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo/coreComponents/unitTests/toolchain && /data/saet/mtml/software/x86_64/cmake-3.24.1-linux-x86_64/bin/cmake -E cmake_link_script CMakeFiles/testToolchain.dir/dlink.txt --verbose=1
/vend/nvidia/cuda/v12.2/bin/nvcc -forward-unknown-to-host-compiler -ccbin=/data/saet/mtml/software/x86_64/RHEL7/hpcx-v2.17-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ompi/bin/mpic++ -restrict -arch sm_80 --expt-extended-lambda --expt-relaxed-constexpr -Werror cross-execution-space-call,reorder,deprecated-declarations  -g -lineinfo  -restrict -arch sm_80 --expt-extended-lambda --expt-relaxed-constexpr -Werror cross-execution-space-call,reorder,deprecated-declarations  -O3 -DNDEBUG -Xcompiler -DNDEBUG -Xcompiler -Ofast   --generate-code=arch=compute_80,code=[compute_80,sm_80] -Xcompiler=-fopenmp -Xcompiler=-L/vend/nvidia/cuda/v12.2/lib64 -Xlinker=-rpath -Xlinker=/data/saet/mtml/software/x86_64/RHEL7/hpcx-v2.17-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ompi/lib -Xlinker=--enable-new-dtags -Xcompiler=-pthread -Xcompiler=-fPIC -Wno-deprecated-gpu-targets -shared -dlink CMakeFiles/testToolchain.dir/testToolchain.cpp.o -o CMakeFiles/testToolchain.dir/cmake_device_link.o   -L/vend/nvidia/cuda/v12.2/targets/x86_64-linux/lib/stubs  -L/vend/nvidia/cuda/v12.2/targets/x86_64-linux/lib  ../../../lib/libgtest_main.a ../../../lib/libgtest.a -lpthread ../../../lib/libphysicsSolvers.a ../../../lib/libdiscretizationMethods.a ../../../lib/libfieldSpecification.a ../../../lib/liblinearAlgebra.a ../../../lib/libdataRepository.a ../../../lib/libevents.a ../../../lib/libfileIO.a ../../../lib/libfiniteVolume.a  /data/saet/mtml/software/x86_64/RHEL7/GEOSTPL/0.2.0/install-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo/hypre/lib/libHYPRE.a ../../../lib/libconstitutive.a ../../../lib/libmesh.a ../../../lib/libhdf5_interface.a /data/saet/mtml/software/x86_64/RHEL7/GEOSTPL/0.2.0/install-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo/silo/lib/libsiloh5.a ../../../lib/libfunctions.a /data/saet/mtml/software/x86_64/RHEL7/GEOSTPL/0.2.0/install-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo/mathpresso/lib/libmathpresso.a ../../../lib/libdenseLinearAlgebra.a ../../../lib/libPVTPackage.a /data/saet/mtml/software/x86_64/RHEL7/GEOSTPL/0.2.0/install-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo/parmetis/lib/libparmetis.a /data/saet/mtml/software/x86_64/RHEL7/GEOSTPL/0.2.0/install-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo/metis/lib/libmetis.a ../../../lib/libschema.a ../../../lib/libfiniteElement.a ../../../lib/libcodingUtilities.a ../../../lib/libcommon.a ../../../lib/liblvarray.a /data/saet/mtml/software/x86_64/RHEL7/GEOSTPL/0.2.0/install-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo/pugixml/lib64/libpugixml.a /data/saet/mtml/software/x86_64/RHEL7/GEOSTPL/0.2.0/install-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo/chai/lib/libchai.a /data/saet/mtml/software/x86_64/RHEL7/GEOSTPL/0.2.0/install-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo/chai/lib/libumpire.a /data/saet/mtml/software/x86_64/RHEL7/GEOSTPL/0.2.0/install-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo/raja/lib/libRAJA.a /data/saet/mtml/software/x86_64/RHEL7/GEOSTPL/0.2.0/install-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo/raja/lib/libcamp.a /vend/nvidia/cuda/v12.2/lib64/libcudart_static.a /data/saet/mtml/software/x86_64/RHEL7/GEOSTPL/0.2.0/install-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo/conduit/lib/libconduit_relay.a -lrt -lm /data/saet/mtml/software/x86_64/RHEL7/GEOSTPL/0.2.0/install-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo/conduit/lib/libconduit_blueprint.a /data/saet/mtml/software/x86_64/RHEL7/GEOSTPL/0.2.0/install-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo/conduit/lib/libconduit.a /data/saet/mtml/software/x86_64/RHEL7/GEOSTPL/0.2.0/install-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo/fmt/lib64/libfmt.a /data/saet/mtml/software/x86_64/RHEL7/GEOSTPL/0.2.0/install-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo/adiak/lib/libadiak.a -ldl /data/saet/mtml/software/x86_64/RHEL7/GEOSTPL/0.2.0/install-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo/scotch/lib/libptscotch.a /data/saet/mtml/software/x86_64/RHEL7/GEOSTPL/0.2.0/install-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo/scotch/lib/libptscotcherr.a /data/saet/mtml/software/x86_64/RHEL7/GEOSTPL/0.2.0/install-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo/scotch/lib/libscotch.a /data/saet/mtml/software/x86_64/RHEL7/GEOSTPL/0.2.0/install-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo/scotch/lib/libscotcherr.a -lcudadevrt -lcudart_static -lmpi 
nvlink error   : Size doesn't match for '_ZN4geos13finiteElement18ImplicitKernelBaseINS_20CellElementSubRegionENS_12constitutive11PorousSolidINS3_16ElasticIsotropicEEENS0_25H1_Wedge_Lagrange1_Gauss6ELi3ELi3EE14StackVariablesC1Ev$567' in '../../../lib/libphysicsSolvers.a:PoromechanicsEFEMKernels_CellElementSubRegion_PorousSolid-ElasticIsotropic-_H1_Wedge_Lagrange1_Gauss6.cpp.o', first specified in '../../../lib/libphysicsSolvers.a:SolidMechanicsFixedStressThermoPoroElasticKernels_CellElementSubRegion_PorousSolid-ElasticIsotropic-_H1_Wedge_Lagrange1_Gauss6.cpp.o' (target: sm_80)
nvlink fatal   : merge_elf failed (target: sm_80)
make[3]: *** [coreComponents/unitTests/toolchain/CMakeFiles/testToolchain.dir/cmake_device_link.o] Error 1
make[3]: Leaving directory `/dev/shm/mtml/src/GEOS/GEOS/build-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo'
make[2]: *** [coreComponents/unitTests/toolchain/CMakeFiles/testToolchain.dir/all] Error 2
make[2]: Leaving directory `/dev/shm/mtml/src/GEOS/GEOS/build-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo'
make[1]: *** [coreComponents/unitTests/toolchain/CMakeFiles/testToolchain.dir/rule] Error 2
make[1]: Leaving directory `/dev/shm/mtml/src/GEOS/GEOS/build-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo'
make: *** [coreComponents/unitTests/toolchain/CMakeFiles/testToolchain.dir/rule] Error 2

Jan 11 '24 20:01 drmichaeltcvx

CUDA 11.7, 11.8 and 12.2 all have the same issue.

Jan 11 '24 20:01 drmichaeltcvx

I have already opened an issue on this https://github.com/GEOS-DEV/GEOS/issues/2856 .

Can we get the investigation started? We cannot build GPU GEOS anymore.

Jan 16 '24 20:01 drmichaeltcvx

Anyone volunteer to do this work?

Feb 02 '24 23:02 rrsettgast

Anyone volunteer to do this work?

I can take care of upgrading our CI ubuntu builds. Let us decide what exactly we want though.

For CPU builds we currently have:

Ubuntu (20.04, gcc 9.3.0, open-mpi 4.0.3)
Ubuntu debug (20.04, gcc 10.3.0, open-mpi 4.0.3) - github codespaces
Ubuntu (20.04, gcc 10.3.0, open-mpi 4.0.3) - github codespaces
Ubuntu (22.04, gcc 11.2.0, open-mpi 4.1.2)
Ubuntu (22.04, gcc 12.3.0, open-mpi 4.1.2)
Pecan CPU (centos 7.7, gcc 8.2.0, open-mpi 4.0.1, mkl 2019.5)
Pangea 2 (centos 7.6, gcc 8.3.0, open-mpi 2.1.5, mkl 2019.3)
Sherlock CPU (centos 7.9.2009, gcc 10.1.0, open-mpi 4.1.2, openblas 0.3.10)

Shall we remove ubuntu 20 and have gcc > 11 or do we want to keep an older version?

For GPU builds:

Ubuntu CUDA debug (20.04, clang 10.0.0 + gcc 9.4.0, open-mpi 4.0.3, cuda-11.8.89)
Ubuntu CUDA (20.04, clang 10.0.0 + gcc 9.4.0, open-mpi 4.0.3, cuda-11.8.89)
Centos (7.7, gcc 8.3.1, open-mpi 1.10.7, cuda 11.8.89)
Pecan GPU (centos 7.7, gcc 8.2.0, open-mpi 4.0.1, mkl 2019.5, cuda 11.5.119)

Do we want to fully move to cuda12 ? I can see what images I can find but we can probably bump up the OS version and the compiler.

Feb 02 '24 23:02 CusiniM

For information, Pangea 2 should be removed in the very next weeks. 🤞 @sframba @jeannepellerin What are our gcc requirements on P3/P4?

Do we want to fully move to cuda12?

I'd be surprised this is something possible w.r.t. all the cluster constraints. @jeannepellerin @sframba @matteofrigo5 @drmichaeltcvx ?

Feb 03 '24 00:02 TotoGaz

We would like to add our CVX configurations for GPU builds on the CI environment. We are using A100 Nvidia GPU h/w and we are on RHEL 7.9.

Can we get some intro walk through on your CI environment?

Feb 06 '24 22:02 drmichaeltcvx

What are our gcc requirements on P3/P4?

On P3 we are using gcc8.4.1 (I know, it's old), and on P4 we use gcc12.1

I'd be surprised this is something possible w.r.t. all the cluster constraints.

We will have to check compatibility with IBM drivers on P3. Let me know if you want to pursue cuda 12 on P3 in the short term, I can ask if IBM support would be available to help

Apr 09 '24 19:04 sframba

GEOS GEOS copied to clipboard

Update CI platforms and compilers

GEOS
GEOS copied to clipboard