cp2k icon indicating copy to clipboard operation
cp2k copied to clipboard

Bug: Possible bugs into the toolchain of cp2k version 2023.2

Open IronCub3 opened this issue 1 year ago • 10 comments

Hi everyone , as suggested in the issue #2968 and also as requested by one of my customer I've tried compile the cp2k version 2023.2 but in the compilation I think I've found also here some bugs into the toolchain. Before describing all the bugs that've run into I give you some of the specs of my machine and all the software that has been used in order to run the toolchain.

Hw specifics

CPU: Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz Core(s): 20 Socket(s): 2 Core(s) per socket: 10 Thread(s) per core: 1 (Hyperthreading is disabled) GPU: 4 x Nvidia Tesla V100-sxm2-16gb

Sw specifics

OS: Rocky Linux release 8.8 (Green Obsidian) GCC Compiler: gcc-8.5.0 --> OS package CMAKE: cmake-3.20.2 --> Installed by the toolchain OpenMPI: ompi-4.1.4 --> Compiled with: gcc-8.5.0 + cuda-12.0.0 + hwloc-2.9.0 + pmix-4.2.2 + gdrcopy-2.3.1 + ucx-1.14.1 + ucc-1.2.0 FFTW: fftw-3.3.10 --> Compiled with: gcc-8.5.0 (single precision) Plumed: plumed-2.9.0 --> Compiled with: gcc-8.5.0 + ompi-4.1.4 (The verison above) + libtorch2.0.1noabi CUDA: cuda-12.0.0 --> Installed from the official Nvidia website Nvidia driver version: 525.60.13

Command to build the Toolchain

./install_cp2k_toolchain.sh -j 20 --no-check-certificate --mpi-mode=openmpi --math-mode=openblas --gpu-ver=V100 --libint-lmax=5 --log-lines=200 --enable-cuda=yes --enable-hip=no --enable-opencl=no --with-gcc=system --with-cmake=install --with-openmpi=system --with-libxc=install --with-libint=install --with-fftw=system --with-openblas=install --with-scalapack=install --with-libxsmm=install --with-elpa=install --with-cusolvermp=install --with-ptscotch=install --with-superlu=install --with-pexsi=install --with-quip=install --with-plumed=system --with-sirius=install --with-gsl=install --with-libvdwxc=install --with-spglib=install --with-hdf5=install --with-spfft=install --with-spla=install --with-cosma=install --with-libvori=install --with-libtorch=system | tee install-cp2k-2023.2-out.txt

Bugs

  1. Stage 5 - ELPA: If I don't specify where are the cuda libs (even if they are correctly loaded by the enviroment module) with the enviroment variables:
export CUDA_PATH=/my-path/libs/nvidia/cuda-12.0.0
export CUDA_HOME=/my-path/libs/nvidia/cuda-12.0.0

otherwise I get this error:

configure: error: Could not link cublas; try to set the cuda-path or disable Nvidia GPU support
  1. Stage 5 - Superlu: While the toolchain was compiling superlu using the make command in the linking phase It returns this error:
[ 90%] Linking CXX executable pzdrive2_ABglobal
[ 91%] Linking CXX executable pzdrive_ABglobal
[ 92%] Linking CXX executable pzdrive3_ABglobal
../SRC/libsuperlu_dist.a(TreeInterface.cpp.o): In function `MPI::Op::Init(void (*)(void const*, void*, int, MPI::Datatype const&), bool)':
/my-path/ompi-4.1.4_nccl_pbs_ucx14_gdr/include/openmpi/ompi/mpi/cxx/op_inln.h:121: undefined reference to `ompi_mpi_cxx_op_intercept'
../SRC/libsuperlu_dist.a(TreeInterface.cpp.o): In function `MPI::Intracomm::Clone() const':
/my-path/ompi-4.1.4_nccl_pbs_ucx14_gdr/include/openmpi/ompi/mpi/cxx/intracomm_inln.h:23: undefined reference to `MPI::Comm::Comm()'
../SRC/libsuperlu_dist.a(TreeInterface.cpp.o): In function `MPI::Graphcomm::Clone() const':
/my-path/ompi-4.1.4_nccl_pbs_ucx14_gdr/include/openmpi/ompi/mpi/cxx/intracomm.h:25: undefined reference to `MPI::Comm::Comm()'
../SRC/libsuperlu_dist.a(TreeInterface.cpp.o): In function `MPI::Cartcomm::Sub(bool const*) const':
/my-path/ompi-4.1.4_nccl_pbs_ucx14_gdr/include/openmpi/ompi/mpi/cxx/intracomm.h:25: undefined reference to `MPI::Comm::Comm()'
../SRC/libsuperlu_dist.a(TreeInterface.cpp.o): In function `MPI::Intracomm::Create_graph(int, int const*, int const*, bool) const':
/my-path/ompi-4.1.4_nccl_pbs_ucx14_gdr/include/openmpi/ompi/mpi/cxx/intracomm.h:25: undefined reference to `MPI::Comm::Comm()'
../SRC/libsuperlu_dist.a(TreeInterface.cpp.o): In function `MPI::Cartcomm::Clone() const':
/my-path/ompi-4.1.4_nccl_pbs_ucx14_gdr/include/openmpi/ompi/mpi/cxx/intracomm.h:25: undefined reference to `MPI::Comm::Comm()'
../SRC/libsuperlu_dist.a(TreeInterface.cpp.o):/my-path/ompi-4.1.4_nccl_pbs_ucx14_gdr/include/openmpi/ompi/mpi/cxx/intracomm_inln.h:23: more undefined references to `MPI::Comm::Comm()' follow
../SRC/libsuperlu_dist.a(TreeInterface.cpp.o):(.data.rel.ro._ZTVN3MPI8DatatypeE[_ZTVN3MPI8DatatypeE]+0x78): undefined reference to `MPI::Datatype::Free()'
../SRC/libsuperlu_dist.a(TreeInterface.cpp.o):(.data.rel.ro._ZTVN3MPI3WinE[_ZTVN3MPI3WinE]+0x48): undefined reference to `MPI::Win::Free()'
collect2: error: ld returned 1 exit status

(This is the firs linking error but the logs continue, let me know if you need the full log).

The error doesn't go away if I use these variables (used to fix this problem in the version 9.1):

Variable used in the old version:

export OMPI_CC=gcc
export OMPI_CXX=g++ 
export OMPI_FC=gfortran
export CC=mpicc
export CXX=mpic++
export FC=mpif90

Extra:

How can I also tell to the installer to install cp2k in a custom path and not in the same directory? Like the --prefix=/my/custom/path for the cmake (I've tried using the variable CURRENT_DIR but it didn't work)

IronCub3 avatar Sep 12 '23 11:09 IronCub3

Does the set of MPI_LIBS (or OPENMPI_LIBS) include -lmpi_cxx? You may check tools/toolchain/install/toolchain.env for this. If I am not mistaken, there is no test using the OpenMPI installation from the host system.

mkrack avatar Sep 13 '23 08:09 mkrack

Yes it does, into the file toolchain.env are declared both variables with value:

declare -x MPI_LIBS=" -lmpi_cxx -lmpi"
declare -x OPENMPI_LIBS=" -lmpi_cxx -lmpi"

IronCub3 avatar Sep 13 '23 09:09 IronCub3

This is the script used to install superlu. Namely, the cmake call is:

https://github.com/cp2k/cp2k/blob/04733487439f7707bd3aa034d082d3f14da973d7/tools/toolchain/scripts/stage5/install_superlu.sh#L46

The assumption here is that the env variables are injected to get the proper OpenMPI, i.e.

export CC=mpicc export CXX=mpic++ export FC=mpif90

Could you inspect the cmake.log file and check if it is taking the right OpenMPI wrappers and flags?

alazzaro avatar Sep 13 '23 09:09 alazzaro

Hi @alazzaro and thanks for the reply (@mkrack thank you too for your reply). Has you said I've tried compiling it by inject the variables you said:

export CC=mpicc
export CXX=mpic++
export FC=mpif90

But the error is the same, I also checked that my OpenMPI libs have the correct definitions he complains about, so for example the first definition that it doesn't find is ompi_mpi_cxx_op_intercept but by doing this:

[bashuser]$ nm libmpi_cxx.so | grep ompi_mpi_cxx_op_intercept
0000000000013870 T ompi_mpi_cxx_op_intercept

I can see by the command used on top that the definition in this library and this also works for all the other missing definitions.

The complete output of the cmake.log is this:

-- The C compiler identification is GNU 8.5.0
-- The CXX compiler identification is GNU 8.5.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/gcc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/g++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done

Process XSDK defaults ...
USE_XSDK_DEFAULTS = 'FALSE'
-- SuperLU_DIST will be built as a static library.
-- The Fortran compiler identification is GNU 8.5.0
-- Detecting Fortran compiler ABI info
-- Detecting Fortran compiler ABI info - done
-- Check for working Fortran compiler: /usr/bin/gfortran - skipped
-- Found MPI_C: /usr/lib64/libxpmem.so (found version "3.1")
-- Found MPI_CXX: /usr/lib64/libxpmem.so (found version "3.1")
-- Found MPI_Fortran: /usr/lib64/libxpmem.so (found version "3.1")
-- Found MPI: TRUE (found version "3.1")
-- Found OpenMP_C: -fopenmp (found version "4.5")
-- Found OpenMP_CXX: -fopenmp (found version "4.5")
-- Found OpenMP_Fortran: -fopenmp (found version "4.5")
-- Found OpenMP: TRUE (found version "4.5")
-- OpenMP_EXE_LINKER_FLAGS=''
-- CMAKE_EXE_LINKER_FLAGS=' -Wl,-rpath -Wl,/my-path/ompi-4.1.4_nccl_pbs_ucx14_gdr/lib -Wl,-rpath -Wl,/my-path/hwloc-2.9.0/lib -Wl,-rpath -Wl,/my-path/ucx-1.14.1_gdr/lib -Wl,-rpath -Wl,/my-path/pmix-4.2.2/lib -Wl,-rpath -Wl,/my-path/pbs/lib -Wl,-rpath -Wl,/my-path/mellanox/hcoll/lib -Wl,-rpath -Wl,/my-path/ucc-1.2.0_ucx-gdr/lib -Wl,--enable-new-dtags -L/my-path/hwloc-2.9.0/lib -L/my-path/ucx-1.14.1_gdr/lib -L/my-path/pmix-4.2.2/lib -L/my-path/pbs/lib -L/my-path/mellanox/hcoll/lib -L/my-path/ucc-1.2.0_ucx-gdr/lib -pthread -fopenmp'
-- Looking for Fortran sgemm
-- Looking for Fortran sgemm - not found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- Looking for Fortran sgemm
-- Looking for Fortran sgemm - found
-- Found BLAS: /my-path/cp2k-2023.2/tools/toolchain/install/openblas-0.3.23/lib/libopenblas.so
-- Using TPL_BLAS_LIBRARIES='/my-path/cp2k-2023.2/tools/toolchain/install/openblas-0.3.23/lib/libopenblas.so'
-- Looking for Fortran cheev
-- Looking for Fortran cheev - found
-- Found LAPACK: /my-path/cp2k-2023.2/tools/toolchain/install/openblas-0.3.23/lib/libopenblas.so;-lm;-ldl
-- Using TPL_LAPACK_LIBRARIES='/my-path/cp2k-2023.2/tools/toolchain/install/openblas-0.3.23/lib/libopenblas.so;-lm;-ldl'
-- Will not link with ParMETIS.
-- Will not link with CombBLAS.
-- Configuring done (96.4s)
-- Generating done (2.1s)
-- Build files have been written to: /my-path/cp2k-2023.2/tools/toolchain/build/superlu_dist-6.1.0/build

The only doubt I have in this log file is this variable OpenMP_EXE_LINKER_FLAGS='' that isn't set and maybe it is needed for linking the libs so I can give it I try. Also If you notice something I've missing in the log let me know It.

IronCub3 avatar Sep 13 '23 11:09 IronCub3

Thanks for the logs. I wonder if by setting CMAKE_EXE_LINKER_FLAGS you overwrite the other MPI linker flags. The OpenMP variable is unrelated in this context. Another test would be to change the installation script and add VERBOSE=1 to the make line. Then we should see which command is giving the error. Finally, you can compile the library by yourself (without the toolchain) and pass it to the toolchain...

alazzaro avatar Sep 13 '23 11:09 alazzaro

Here I am with other infos. I was able to override the CMAKE_EXE_LINKER_FLAGS in order to link the openmpi lib path using this string -Wl,-rpath -Wl,/my-path/ompi-4.1.4_nccl_pbs_ucx14_gdr/lib and I wasn't unable to override the variable OpenMP_EXE_LINKER_FLAGS (I don't know why it doesn't read that variable). But even with these changes the error persist and in the verbose output of the make command the first command to fail is this one:

/usr/bin/g++ -fopenmp -fexceptions -pthread -std=c++11 -O2 -fPIC -fno-omit-frame-pointer -fopenmp -g -march=native -mtune=native  -Wl,-rpath -Wl,/my-path/ompi-4.1.4_nccl_pbs_ucx14_gdr/lib -Wl,-rpath -Wl,/my-path/hwloc-2.9.0/lib -Wl,-rpath -Wl,/my-path/ucx-1.14.1_gdr/lib -Wl,-rpath -Wl,/my-path/pmix-4.2.2/lib -Wl,-rpath -Wl,/my-path/pbs/lib -Wl,-rpath -Wl,/my-path/mellanox/hcoll/lib -Wl,-rpath -Wl,/my-path/ucc-1.2.0_ucx-gdr/lib -Wl,--enable-new-dtags -L/my-paht/hwloc-2.9.0/lib -L/my-path/ucx-1.14.1_gdr/lib -L/my-path/pmix-4.2.2/lib -L/my-path/pbs/lib -L/my-path/mellanox/hcoll/lib -L/my-path/ucc-1.2.0_ucx-gdr/lib -pthread -fopenmp -rdynamic CMakeFiles/pzdrive3_ABglobal.dir/pzdrive3_ABglobal.c.o -o pzdrive3_ABglobal  ../SRC/libsuperlu_dist.a -lopenblas -lm -lxpmem -lmpi -lopenblas -lm -ldl -lm
../SRC/libsuperlu_dist.a(TreeInterface.cpp.o): In function `MPI::Op::Init(void (*)(void const*, void*, int, MPI::Datatype const&), bool)':
/my-path/ompi-4.1.4_nccl_pbs_ucx14_gdr/include/openmpi/ompi/mpi/cxx/op_inln.h:121: undefined reference to `ompi_mpi_cxx_op_intercept'
../SRC/libsuperlu_dist.a(TreeInterface.cpp.o): In function `MPI::Intracomm::Clone() const':

## .....the errors of undefined references continues 

I was leaving the manual compilation as the last thing to do because I thought It would be interesting and better to fix the toolchain but in this case as you suggested and until a solution comes up I think I will try to compile SuperLU by myself. If I'll find a solution I will let you know here and also if someone else has other ideas are welcome. Thanks again!

IronCub3 avatar Sep 13 '23 14:09 IronCub3

OK, so it uses g++ instead of mpic++, but I think this is fine. Then you are linking:

-lopenblas -lm -lxpmem -lmpi -lopenblas -lm -ldl -lm

so it missing the C++ MPI library. I can assume cmake doesn't include it, so you have to force it. Please note that this C++ binding (which is now deprecated) is requested by OpenMPI itself. I suggest to really compile the library outside the toolchain.

alazzaro avatar Sep 13 '23 15:09 alazzaro

Here I am with some news! After I've solved different bugs into the toolchain I've managed to get the arch files I needed but now when I'm using the command: make -j 20 VERBOSE=1 ARCH=local_cuda VERSION="ssmp sdbg psmp pdbg" 2>&1 | tee make.log after few steps this command fails with this error:

/usr/bin/gcc -c -fno-omit-frame-pointer -fopenmp -g -mtune=native  -O1     -I'/my-path/cp2k-2023.2/tools/toolchain/install/openblas-0.3.23/include' -I'/my-path/fftw-3.3.10/include' -I'/my-path/cp2k-2023.2/tools/toolchain/install/libint-v2.6.0-cp2k-lmax-5/include' -I'/my-path/cp2k-2023.2/tools/toolchain/install/libxc-6.2.2/include' -I'/my-path/cp2k-2023.2/tools/toolchain/install/libxsmm-1.17/include' -I'/my-path/cp2k-2023.2/tools/toolchain/install/COSMA-2.6.6/include'     -I'/my-path/cp2k-2023.2/tools/toolchain/install/quip-0.9.10/include' -I'/my-path/cp2k-2023.2/tools/toolchain/install/gsl-2.7/include' -I/my-path/cp2k-2023.2/tools/toolchain/install/hdf5-1.12.0/include  -I/my-path/cp2k-2023.2/tools/toolchain/install/spglib-1.16.2/include -I'/my-path/cp2k-2023.2/tools/toolchain/install/SpFFT-1.0.6/include' -I'/my-path/cp2k-2023.2/tools/toolchain/install/SpLA-1.5.5/include/spla'  -std=c11 -Wall -Wextra -Werror -Wno-vla-parameter -Wno-deprecated-declarations -D__OFFLOAD_CUDA -D__DBCSR_ACC -D__OFFLOAD_PROFILING -D__CUSOLVERMP -D__LIBXSMM   -D__FFTW3  -D__LIBINT -D__LIBXC     -D__QUIP     -D__SPGLIB -D__LIBVORI -D__LIBTORCH   -D__OFFLOAD_GEMM    -D__HAS_IEEE_EXCEPTIONS -D__CHECK_DIAG  -I/my-path/nvidia/cuda-12.0.0/include /my-path/cp2k-2023.2/src/fm/cp_fm_cusolver.c
/my-path/cp2k-2023.2/src/fm/cp_fm_cusolver.c:12:10: fatal error: cal.h: No such file or directory
 #include <cal.h>
          ^~~~~~~
compilation terminated.
make[3]: *** [/my-path/cp2k-2023.2/Makefile:518: cp_fm_cusolver.o] Error 1
make[2]: *** [/my-path/cp2k-2023.2/Makefile:146: all] Error 2
make[1]: *** [/my-path/cp2k-2023.2/Makefile:128: ssmp] Error 2
make[1]: *** Waiting for unfinished jobs....

IronCub3 avatar Sep 25 '23 08:09 IronCub3

Note that cuSOLVERMp is still pretty new...

If you don't need it then simply remove --with-cusolvermp=install from your toolchain command line.

If you do want to try it then have a look at this script, which I've been using for installing cuSOLVERMp and its dependencies.

oschuett avatar Sep 25 '23 10:09 oschuett

Yep @oschuett, I found that the problem was that so for now I removed it and I also saw that in the version 2023.2 there isn't a script that installs it so that was the main problem. Thanks also for giving me the script I will try it for sure! Today I will also update this issue with the main problems I've faced in the tool chain in order to make this usefull also for other users and maybe also to improve the code ;)

IronCub3 avatar Sep 25 '23 11:09 IronCub3