GEOS icon indicating copy to clipboard operation
GEOS copied to clipboard

Update CI platforms and compilers

Open rrsettgast opened this issue 1 year ago • 12 comments

We need to update CI platforms and compilers.

  • On LLNL/Quartz we have:

    • gcc12
    • clang14
  • On LLNL/Lassen we have:

    • gcc8
    • clang14
    • cuda10
    • cuda11
    • cuda12
  • On TotalEnergies/Pangea3 we have:

    • gcc8
    • gcc9
    • cuda10
    • cuda11
  • On Frontier we have:

    • clang??
    • rocm??
  • On ElCap we will have:

    • clang16??
    • rocm ??

Proposed Permutations:

  • ubuntu22

    • gcc11
    • gcc11 + cuda11
    • clang14
    • clang14 + cuda11
  • TOSS4 (built on RHEL8.8)

    • Which linux distribution? ubi8.8?
    • gcc12
    • gcc12 - cuda12
    • clang15
    • clang15 - cuda12
  • TotalEnergies/Cypress

    • gcc8
    • gcc10
    • gcc12
    • cuda10
    • cuda11
    • cuda12

rrsettgast avatar Dec 21 '23 18:12 rrsettgast

Chevron is currently using

  • GCC 11.2 and 11.4 (well tested and broadly used)

  • GCC 13.2 (not as extensively tested but no build or run failures so far)

  • OpenMPI HPC-X (v14.1 in broad use and some v17.1)

For GPU GEOS we have been using

  • CUDA 11.2, 11.4 : GEOS stopped building with GCC 11.x passed commit point 95aea4cb2 (we'll be testing with GCC 12.x)
  • HPC-X v14.1 (mostly)

drmichaeltcvx avatar Jan 11 '24 15:01 drmichaeltcvx

For GPU GEOS we have been using

  • CUDA 11.2, 11.4 : GEOS stopped building with GCC 11.x passed commit point 95aea4c (we'll be testing with GCC 12.x)

I bet there's an issue there

  GEOS_HOST_DEVICE
  virtual real64 getShearModulus( localIndex const k ) const override final
  {
    return std::max( std::max( m_c44[k], m_c55[k] ), m_c66[k] );
  }

you can't call a std function on device so its normal. This should not have been merged. You should use LvArray::math::max instead. See https://github.com/GEOS-DEV/GEOS/pull/2927 e.g. @CusiniM Maybe should we be stricter on our review process? And also, I do not understand how it can go through the CI. Maybe some over-relaxed compilations parameters?

TotoGaz avatar Jan 11 '24 18:01 TotoGaz

For GPU GEOS we have been using

  • CUDA 11.2, 11.4 : GEOS stopped building with GCC 11.x passed commit point 95aea4c (we'll be testing with GCC 12.x)

I bet there's an issue there

  GEOS_HOST_DEVICE
  virtual real64 getShearModulus( localIndex const k ) const override final
  {
    return std::max( std::max( m_c44[k], m_c55[k] ), m_c66[k] );
  }

you can't call a std function on device so its normal. This should not have been merged. You should use LvArray::math::max instead. See #2927 e.g. @CusiniM Maybe should we be stricter on our review process? And also, I do not understand how it can go through the CI. Maybe some over-relaxed compilations parameters?

How it passed the CI beats me...

CusiniM avatar Jan 11 '24 19:01 CusiniM

For GPU GEOS we have been using

  • CUDA 11.2, 11.4 : GEOS stopped building with GCC 11.x passed commit point 95aea4c (we'll be testing with GCC 12.x)

I bet there's an issue there

  GEOS_HOST_DEVICE
  virtual real64 getShearModulus( localIndex const k ) const override final
  {
    return std::max( std::max( m_c44[k], m_c55[k] ), m_c66[k] );
  }

you can't call a std function on device so its normal. This should not have been merged. You should use LvArray::math::max instead. See #2927 e.g. @CusiniM Maybe should we be stricter on our review process? And also, I do not understand how it can go through the CI. Maybe some over-relaxed compilations parameters?

Thanks, Thomas, but those specific seems to be already fixed in https://github.com/GEOS-DEV/GEOS/pull/2812, but build still fails for Michael.

paveltomin avatar Jan 11 '24 20:01 paveltomin

Here is where the build process (host comp: GCC11.X and 12.X) fails at the link stage:

Consolidate compiler generated dependencies of target testToolchain
make[3]: Leaving directory `/dev/shm/mtml/src/GEOS/GEOS/build-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo'
make  -f coreComponents/unitTests/toolchain/CMakeFiles/testToolchain.dir/build.make coreComponents/unitTests/toolchain/CMakeFiles/testToolchain.dir/build
make[3]: Entering directory `/dev/shm/mtml/src/GEOS/GEOS/build-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo'
[100%] Linking CUDA device code CMakeFiles/testToolchain.dir/cmake_device_link.o
cd /dev/shm/mtml/src/GEOS/GEOS/build-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo/coreComponents/unitTests/toolchain && /data/saet/mtml/software/x86_64/cmake-3.24.1-linux-x86_64/bin/cmake -E cmake_link_script CMakeFiles/testToolchain.dir/dlink.txt --verbose=1
/vend/nvidia/cuda/v12.2/bin/nvcc -forward-unknown-to-host-compiler -ccbin=/data/saet/mtml/software/x86_64/RHEL7/hpcx-v2.17-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ompi/bin/mpic++ -restrict -arch sm_80 --expt-extended-lambda --expt-relaxed-constexpr -Werror cross-execution-space-call,reorder,deprecated-declarations  -g -lineinfo  -restrict -arch sm_80 --expt-extended-lambda --expt-relaxed-constexpr -Werror cross-execution-space-call,reorder,deprecated-declarations  -O3 -DNDEBUG -Xcompiler -DNDEBUG -Xcompiler -Ofast   --generate-code=arch=compute_80,code=[compute_80,sm_80] -Xcompiler=-fopenmp -Xcompiler=-L/vend/nvidia/cuda/v12.2/lib64 -Xlinker=-rpath -Xlinker=/data/saet/mtml/software/x86_64/RHEL7/hpcx-v2.17-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ompi/lib -Xlinker=--enable-new-dtags -Xcompiler=-pthread -Xcompiler=-fPIC -Wno-deprecated-gpu-targets -shared -dlink CMakeFiles/testToolchain.dir/testToolchain.cpp.o -o CMakeFiles/testToolchain.dir/cmake_device_link.o   -L/vend/nvidia/cuda/v12.2/targets/x86_64-linux/lib/stubs  -L/vend/nvidia/cuda/v12.2/targets/x86_64-linux/lib  ../../../lib/libgtest_main.a ../../../lib/libgtest.a -lpthread ../../../lib/libphysicsSolvers.a ../../../lib/libdiscretizationMethods.a ../../../lib/libfieldSpecification.a ../../../lib/liblinearAlgebra.a ../../../lib/libdataRepository.a ../../../lib/libevents.a ../../../lib/libfileIO.a ../../../lib/libfiniteVolume.a  /data/saet/mtml/software/x86_64/RHEL7/GEOSTPL/0.2.0/install-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo/hypre/lib/libHYPRE.a ../../../lib/libconstitutive.a ../../../lib/libmesh.a ../../../lib/libhdf5_interface.a /data/saet/mtml/software/x86_64/RHEL7/GEOSTPL/0.2.0/install-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo/silo/lib/libsiloh5.a ../../../lib/libfunctions.a /data/saet/mtml/software/x86_64/RHEL7/GEOSTPL/0.2.0/install-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo/mathpresso/lib/libmathpresso.a ../../../lib/libdenseLinearAlgebra.a ../../../lib/libPVTPackage.a /data/saet/mtml/software/x86_64/RHEL7/GEOSTPL/0.2.0/install-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo/parmetis/lib/libparmetis.a /data/saet/mtml/software/x86_64/RHEL7/GEOSTPL/0.2.0/install-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo/metis/lib/libmetis.a ../../../lib/libschema.a ../../../lib/libfiniteElement.a ../../../lib/libcodingUtilities.a ../../../lib/libcommon.a ../../../lib/liblvarray.a /data/saet/mtml/software/x86_64/RHEL7/GEOSTPL/0.2.0/install-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo/pugixml/lib64/libpugixml.a /data/saet/mtml/software/x86_64/RHEL7/GEOSTPL/0.2.0/install-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo/chai/lib/libchai.a /data/saet/mtml/software/x86_64/RHEL7/GEOSTPL/0.2.0/install-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo/chai/lib/libumpire.a /data/saet/mtml/software/x86_64/RHEL7/GEOSTPL/0.2.0/install-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo/raja/lib/libRAJA.a /data/saet/mtml/software/x86_64/RHEL7/GEOSTPL/0.2.0/install-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo/raja/lib/libcamp.a /vend/nvidia/cuda/v12.2/lib64/libcudart_static.a /data/saet/mtml/software/x86_64/RHEL7/GEOSTPL/0.2.0/install-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo/conduit/lib/libconduit_relay.a -lrt -lm /data/saet/mtml/software/x86_64/RHEL7/GEOSTPL/0.2.0/install-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo/conduit/lib/libconduit_blueprint.a /data/saet/mtml/software/x86_64/RHEL7/GEOSTPL/0.2.0/install-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo/conduit/lib/libconduit.a /data/saet/mtml/software/x86_64/RHEL7/GEOSTPL/0.2.0/install-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo/fmt/lib64/libfmt.a /data/saet/mtml/software/x86_64/RHEL7/GEOSTPL/0.2.0/install-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo/adiak/lib/libadiak.a -ldl /data/saet/mtml/software/x86_64/RHEL7/GEOSTPL/0.2.0/install-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo/scotch/lib/libptscotch.a /data/saet/mtml/software/x86_64/RHEL7/GEOSTPL/0.2.0/install-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo/scotch/lib/libptscotcherr.a /data/saet/mtml/software/x86_64/RHEL7/GEOSTPL/0.2.0/install-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo/scotch/lib/libscotch.a /data/saet/mtml/software/x86_64/RHEL7/GEOSTPL/0.2.0/install-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo/scotch/lib/libscotcherr.a -lcudadevrt -lcudart_static -lmpi 
nvlink error   : Size doesn't match for '_ZN4geos13finiteElement18ImplicitKernelBaseINS_20CellElementSubRegionENS_12constitutive11PorousSolidINS3_16ElasticIsotropicEEENS0_25H1_Wedge_Lagrange1_Gauss6ELi3ELi3EE14StackVariablesC1Ev$567' in '../../../lib/libphysicsSolvers.a:PoromechanicsEFEMKernels_CellElementSubRegion_PorousSolid-ElasticIsotropic-_H1_Wedge_Lagrange1_Gauss6.cpp.o', first specified in '../../../lib/libphysicsSolvers.a:SolidMechanicsFixedStressThermoPoroElasticKernels_CellElementSubRegion_PorousSolid-ElasticIsotropic-_H1_Wedge_Lagrange1_Gauss6.cpp.o' (target: sm_80)
nvlink fatal   : merge_elf failed (target: sm_80)
make[3]: *** [coreComponents/unitTests/toolchain/CMakeFiles/testToolchain.dir/cmake_device_link.o] Error 1
make[3]: Leaving directory `/dev/shm/mtml/src/GEOS/GEOS/build-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo'
make[2]: *** [coreComponents/unitTests/toolchain/CMakeFiles/testToolchain.dir/all] Error 2
make[2]: Leaving directory `/dev/shm/mtml/src/GEOS/GEOS/build-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo'
make[1]: *** [coreComponents/unitTests/toolchain/CMakeFiles/testToolchain.dir/rule] Error 2
make[1]: Leaving directory `/dev/shm/mtml/src/GEOS/GEOS/build-GPU-Hypre-GCC-CUDA_12.2-ompi_hpcx-OMP-relwithdebinfo'
make: *** [coreComponents/unitTests/toolchain/CMakeFiles/testToolchain.dir/rule] Error 2

drmichaeltcvx avatar Jan 11 '24 20:01 drmichaeltcvx

CUDA 11.7, 11.8 and 12.2 all have the same issue.

drmichaeltcvx avatar Jan 11 '24 20:01 drmichaeltcvx

I have already opened an issue on this https://github.com/GEOS-DEV/GEOS/issues/2856 .

Can we get the investigation started? We cannot build GPU GEOS anymore.

drmichaeltcvx avatar Jan 16 '24 20:01 drmichaeltcvx

Anyone volunteer to do this work?

rrsettgast avatar Feb 02 '24 23:02 rrsettgast

Anyone volunteer to do this work?

I can take care of upgrading our CI ubuntu builds. Let us decide what exactly we want though.

For CPU builds we currently have:

  • Ubuntu (20.04, gcc 9.3.0, open-mpi 4.0.3)
  • Ubuntu debug (20.04, gcc 10.3.0, open-mpi 4.0.3) - github codespaces
  • Ubuntu (20.04, gcc 10.3.0, open-mpi 4.0.3) - github codespaces
  • Ubuntu (22.04, gcc 11.2.0, open-mpi 4.1.2)
  • Ubuntu (22.04, gcc 12.3.0, open-mpi 4.1.2)
  • Pecan CPU (centos 7.7, gcc 8.2.0, open-mpi 4.0.1, mkl 2019.5)
  • Pangea 2 (centos 7.6, gcc 8.3.0, open-mpi 2.1.5, mkl 2019.3)
  • Sherlock CPU (centos 7.9.2009, gcc 10.1.0, open-mpi 4.1.2, openblas 0.3.10)

Shall we remove ubuntu 20 and have gcc > 11 or do we want to keep an older version?

For GPU builds:

  • Ubuntu CUDA debug (20.04, clang 10.0.0 + gcc 9.4.0, open-mpi 4.0.3, cuda-11.8.89)
  • Ubuntu CUDA (20.04, clang 10.0.0 + gcc 9.4.0, open-mpi 4.0.3, cuda-11.8.89)
  • Centos (7.7, gcc 8.3.1, open-mpi 1.10.7, cuda 11.8.89)
  • Pecan GPU (centos 7.7, gcc 8.2.0, open-mpi 4.0.1, mkl 2019.5, cuda 11.5.119)

Do we want to fully move to cuda12 ? I can see what images I can find but we can probably bump up the OS version and the compiler.

CusiniM avatar Feb 02 '24 23:02 CusiniM

For information, Pangea 2 should be removed in the very next weeks. 🤞 @sframba @jeannepellerin What are our gcc requirements on P3/P4?

Do we want to fully move to cuda12?

I'd be surprised this is something possible w.r.t. all the cluster constraints. @jeannepellerin @sframba @matteofrigo5 @drmichaeltcvx ?

TotoGaz avatar Feb 03 '24 00:02 TotoGaz

We would like to add our CVX configurations for GPU builds on the CI environment. We are using A100 Nvidia GPU h/w and we are on RHEL 7.9.

Can we get some intro walk through on your CI environment?

drmichaeltcvx avatar Feb 06 '24 22:02 drmichaeltcvx

What are our gcc requirements on P3/P4?

On P3 we are using gcc8.4.1 (I know, it's old), and on P4 we use gcc12.1

I'd be surprised this is something possible w.r.t. all the cluster constraints.

We will have to check compatibility with IBM drivers on P3. Let me know if you want to pursue cuda 12 on P3 in the short term, I can ask if IBM support would be available to help

sframba avatar Apr 09 '24 19:04 sframba