pair_allegro
pair_allegro copied to clipboard
LAMMPS-Allegro compile failed with pytorch 1.11.0 I build...
Hi,
From our cluster environment, pre-built libtorch 1.11.0 doesn't properly work with openmpi. I build a LAMMPS-Allegro with prebuilt libtorch 1.11.0, but when I submit a job with multiple GPUs, then nothing is printed out to output folder even though slurm system indicates the simulation is running.
So I build a pytorch 1.11.0 using cmake from a virtual environment using following cmake settings:
cmake \
-D BUILD_SHARED_LIBS:BOOL=ON -D CMAKE_BUILD_TYPE:STRING=Release -D BUILD_PYTHON:BOOL=OFF \
-D CMAKE_INSTALL_PREFIX=/home/Sourcecode_Pytorch1110 \
-D CMAKE_MPI_CXX_COMPILER=/cm/shared/userapps/scicomp/external/milan-a100/openmpi/4.1.1-gcc11.2.0-v2/bin/mpicxx \
-D CMAKE_MPI_C_COMPILER=/cm/shared/userapps/scicomp/external/milan-a100/openmpi/4.1.1-gcc11.2.0-v2/bin/mpicc \
-D PYTHON_LIBRARY='' -D USE_CUDA=ON -D BUILD_SHARED_LIBS=ON -D USE_DISTRIBUTED=ON ../ 2>&1| tee configure.log
Then I tried to cmake the LAMMPS-Allegro (with kokkos and openmp) using the pytorch I compiled from the same virtual environment. Following is cmake setting I used for LAMMPS-Allegro with Kokkos and OpenMP:
cmake \
-D CMAKE_BUILD_TYPE=Release \
-D CMAKE_INSTALL_PREFIX=$(pwd) \
-D PKG_OPENMP=ON \
-D PKG_KOKKOS=ON \
-D Kokkos_ENABLE_CUDA=ON \
-D Kokkos_ARCH_ZEN=ON \
-D CMAKE_PREFIX_PATH=/home/Sourcecode_Pytorch1110/build \
-D LD_LIBRARY_PATH=/home/Sourcecode_Pytorch1110/build/lib \
-D MKL_INCLUDE_DIR=`python -c "import sysconfig;from pathlib import Path;print(Path(sysconfig.get_paths()[\"include\"]).parent)"` \
../cmake 2>&1| tee configure.log
However, I see following error messages when I try to configure the LAMMPS-Allegro with OpenMP and Kokkos:
CMake Error at /home/Sourcecode_Pytorch1110/build/Caffe2Config.cmake:14 (include):
include could not find requested file:
/home/Sourcecode_Pytorch1110/build/public/utils.cmake
Call Stack (most recent call first):
/home/Sourcecode_Pytorch1110/build/TorchConfig.cmake:68 (find_package)
CMakeLists.txt:1082 (find_package)
CMake Error at /home/Sourcecode_Pytorch1110/build/Caffe2Config.cmake:17 (include):
include could not find requested file:
/home/Sourcecode_Pytorch1110/build/public/threads.cmake
Call Stack (most recent call first):
/home/Sourcecode_Pytorch1110/build/TorchConfig.cmake:68 (find_package)
CMakeLists.txt:1082 (find_package)
CMake Error at /home/Sourcecode_Pytorch1110/build/Caffe2Config.cmake:88 (include):
include could not find requested file:
/home/Sourcecode_Pytorch1110/build/public/cuda.cmake
Call Stack (most recent call first):
/home/Sourcecode_Pytorch1110/build/TorchConfig.cmake:68 (find_package)
CMakeLists.txt:1082 (find_package)
CMake Error at /home/Sourcecode_Pytorch1110/build/Caffe2Config.cmake:109 (include):
include could not find requested file:
/home/Sourcecode_Pytorch1110/build/public/mkl.cmake
Call Stack (most recent call first):
/home/Sourcecode_Pytorch1110/build/TorchConfig.cmake:68 (find_package)
CMakeLists.txt:1082 (find_package)
CMake Error at /home/Sourcecode_Pytorch1110/build/Caffe2Config.cmake:112 (include):
include could not find requested file:
/home/Sourcecode_Pytorch1110/build/public/mkldnn.cmake
Call Stack (most recent call first):
/home/Sourcecode_Pytorch1110/build/TorchConfig.cmake:68 (find_package)
CMakeLists.txt:1082 (find_package)
CMake Error at /home/Sourcecode_Pytorch1110/build/Caffe2Config.cmake:116 (include):
include could not find requested file:
/home/Sourcecode_Pytorch1110/build/Caffe2Targets.cmake
Call Stack (most recent call first):
/home/Sourcecode_Pytorch1110/build/TorchConfig.cmake:68 (find_package)
CMakeLists.txt:1082 (find_package)
CMake Error at /home/Sourcecode_Pytorch1110/build/TorchConfig.cmake:186 (set_target_properties):
set_target_properties Can not find target to add properties to: torch
Call Stack (most recent call first):
CMakeLists.txt:1082 (find_package)
CMake Error at /home/Sourcecode_Pytorch1110/build/TorchConfig.cmake:191 (set_property):
set_property could not find TARGET torch. Perhaps it has not yet been
created.
Call Stack (most recent call first):
CMakeLists.txt:1082 (find_package)
-- Found Torch: /home/Sourcecode_Pytorch1110/build/lib/libtorch.so
-- Configuring incomplete, errors occurred!
See also "/home/Sourcecode_LAMMPS_Allegro_cuda113_custompytorch1110_zeusgpu_20240725/build01/CMakeFiles/CMakeOutput.log".
I don't know what these error means. Would this means my pytorch 1.11.0 compilation wrong?
Modules I loaded to compile pytorch 1.11.0 and LAMMPS-Allegro in this virtual environment are:
module load gcc/8.5.0-gcc-milan-a100 cuda11.3 openmpi/4.1.1-gcc-milan-a100 cudnn/8.1.1.33-11.2-gcc-milan-a100 git cmake python39
I didn't designate any CXX, C, MPI_CXX, and MPI_C compiler for cmake setting of LAMMPS-Allegro, only from Pytorch, but pytorch didn't used those MPICXX and MPIC compilers I set... Could this be related to the error I see?
Thanks.
Hi,
For running with LAMMPS, PyTorch should not interact with or need to know anything about MPI, and PyTorch can safely be built with -DUSE_DISTRIBUTED=OFF. If your simulation is hanging, you may want to try with Kokkos - this can sometimes make device assignment more reliable. We've also seen esoteric hang-ups related to modules on certain clusters.
As for your self-built PyTorch, you may need to specify an install prefix and run make install, then point -DCMAKE_PREFIX_PATH to that install folder, which will have the correct/expected directory structure, when configuring LAMMPS. But since you have CUDA 11.3 available, the prebuilt PyTorch 1.11 with the CXX11 ABI should work (link).
Hmmm I think I build the LAMMPS-Allegro with prebuilt libtorch with Kokkos, but maybe I messed this up. Let me try both suggestions from scratch again, I will update the results after I build test executables. Thanks.
Remember to also add the appropriate run-time command line flags. For two nodes with 4 GPUs each, it should be
mpirun/srun/etc /path/to/lmp -sf kk -k on g 4 -pk kokkos newton on neigh full -in in.script