Building and running with GPUs
Hi,
I'm trying to get amuse up and running with GPUs but haven't had any success. Specifically I want petar and fastkick to run on GPUs. I've been using this script to build amuse:
#! /bin/bash
module purge
module load foss/2022a
module load CUDA/11.7.0
module load GSL/2.7-GCC-11.3.0
module load Miniconda3/23.1.0-1
#
# Change path below to where you want this installed:
#
export INST_DIR=$HOME/soft/amuse-gpu
#
#
export ENV_DIR=$INST_DIR/amuse-env
echo "This script will install AMUSE in the directory: $INST_DIR"
read -r -p "Are you sure? [y/N]" -n 1
echo
if [[ "$REPLY" =~ ^[Yy]$ ]]; then
mkdir -p $INST_DIR
cd $INST_DIR
conda create -y --prefix $ENV_DIR --copy python=3.10
conda init bash
conda activate $ENV_DIR
conda install -y mpi4py docutils numpy pytest h5py matplotlib scipy astropy pandas seaborn
# edit from here to instead install amuse from source so we can configure it with GPU eventually
cd $INST_DIR
git clone -b feature/galaxy-cluster https://github.com/fredt00/amuse.git
cd amuse
pip install --upgrade pip
pip install -e . --no-cache-dir
./configure --enable-cuda
make framework
make petar.code
make bhtree.code
make fastkick.code
make halogen.code
make hop.code
make fi.code
fi
But it always fails at ./configure, complaining that configure: error: cannot find cuda runtime libraries in /apps/system/easybuild/software/CUDA/11.7.0/lib /apps/system/easybuild/software/CUDA/11.7.0/lib64.
This slightly convoluted installation seems to be the only way to get MPI working correctly for the non GPU installation which only works at runtime if I use miniconda as above.
I tried running just ./configure and then manually editing config.mk to
CUDA_ENABLED=yes
NVCC=/apps/system/easybuild/software/CUDA/11.7.0/bin/nvcc
NVCC_FLAGS=
CUDA_TK=/apps/system/easybuild/software/CUDA/11.7.0
CUDA_LIBS=-L/apps/system/easybuild/software/CUDA/11.7.0/targets/x86_64-linux/lib/stubs -lcuda -L/apps/system/easybuild/software/CUDA/11.7.0/lib64 -lcudart
And the GPU versions of the codes built successfully. However then running them with this script:
#!/bin/bash -l
#SBATCH -J galaxy-cluster
#SBATCH -o galaxy-cluster.%J.out
#SBATCH -e galaxy-cluster.%J.err
#SBATCH --partition=devel
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=28
#SBATCH --gpus=4
#SBATCH --mem-per-cpu=4000
#SBATCH --time=00:10:00
export OMPI_MCA_rmaps_base_oversubscribe=yes
export OMPI_MCA_mpi_warn_on_fork=0
export OMPI_MCA_rmaps_base_oversubscribe=yes
export OMP_STACKSIZE=128M
export OMP_NUM_THREADS=2
ulimit -s unlimited
module purge
module load foss/2022a
module load GSL/2.7-GCC-11.3.0
module load Miniconda3/23.1.0-1
conda activate /home/oxfd1327/soft/amuse-gpu/amuse-env
nvidia-smi
mpirun python -u $@
And I get the error:
/home/oxfd1327/soft/amuse-gpu/amuse/src/amuse/rfi/core.py:964: UserWarning: MPI (unexpectedly?) not available, falling back to sockets channel
warnings.warn("MPI (unexpectedly?) not available, falling back to sockets channel")
**********************************************************
mpiexec does not support recursive calls
**********************************************************
Traceback (most recent call last):
File "/home/oxfd1327/soft/amuse-gpu/amuse/src/amuse/rfi/channel.py", line 1778, in accept_worker_connection
return server_socket.accept()
File "/home/oxfd1327/soft/amuse-gpu/amuse-env/lib/python3.10/socket.py", line 293, in accept
fd, addr = self._accept()
TimeoutError: timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/oxfd1327/soft/amuse-gpu/amuse/examples/fred/galaxy_cluster_master.py", line 349, in <module>
main(**o.__dict__)
File "/home/oxfd1327/soft/amuse-gpu/amuse/examples/fred/galaxy_cluster_master.py", line 207, in main
cluster = star_cluster(code=petar,code_converter=converter_petar, W0=W0, r_tidal=r_tidal,r_half=r_half, n_particles=N_cluster, M_cluster=M_cluster, field_code=FastKick,field_code_number_of_workers=1,code_number_of_workers=2)
File "/home/oxfd1327/soft/amuse-gpu/amuse/src/amuse/ext/derived_grav_systems.py", line 94, in __init__
self.bound=code(self.converter, mode='gpu',number_of_workers=code_number_of_workers)
File "/home/oxfd1327/soft/amuse-gpu/amuse/src/amuse/community/petar/interface.py", line 409, in __init__
petarInterface(**keyword_arguments),
File "/home/oxfd1327/soft/amuse-gpu/amuse/src/amuse/community/petar/interface.py", line 38, in __init__
CodeInterface.__init__(
File "/home/oxfd1327/soft/amuse-gpu/amuse/src/amuse/rfi/core.py", line 748, in __init__
self._start(name_of_the_worker = name_of_the_worker, **options)
File "/home/oxfd1327/soft/amuse-gpu/amuse/src/amuse/rfi/core.py", line 776, in _start
self.channel.start()
File "/home/oxfd1327/soft/amuse-gpu/amuse/src/amuse/rfi/channel.py", line 1962, in start
self.socket, address = self.accept_worker_connection(server_socket, self.process)
File "/home/oxfd1327/soft/amuse-gpu/amuse/src/amuse/rfi/channel.py", line 1782, in accept_worker_connection
raise exceptions.CodeException('could not connect to worker, worker process terminated')
amuse.support.exceptions.CodeException: could not connect to worker, worker process terminated
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[27418,1],0]
Exit code: 1
--------------------------------------------------------------------------
Is there anything obviously wrong with this process? Any help would be greatly appreciated!
Cheers, Fred
Hi Fred,
There are a few things that jump out to me in your scripts:
- neither of them load an MPI module,
- the compile script loads the CUDA module but the run script doesn't,
- you're starting Python using mpirun.
If MPI is available on the machine without a module and that's the one you want to use, then point 1. should be okay.
For 2., this could be causing you to use a different CUDA when compiling (the one from the module) than when running (some other version on the system), and that tends to cause problems. It's important to run in the same environment that you've compiled in.
I think 3. is the cause of the mpiexec does not support recursive calls message. AMUSE uses MPI in a different way than most applications: instead of having many copies of your script running in parallel, there's only one copy, which will dynamically create parallel community code instances as needed within your allocation. So you should start your script without mpirun, AMUSE will call it itself if needed (it has other ways of starting models too).
Oh, and about ./configure failing to detect the CUDA libraries, that's an interesting one. I'm currently working on the build system, and I've rewritten the CUDA detection logic because CUDA has changed over time and it could use an update. I'm going to check that the new system works with this directory layout, and if it doesn't, fix it.
Thanks for reporting this even if you worked around it already, it's much better to fix things like this on the AMUSE side where we can fix it for everyone else too.
Thanks for the advice! In terms of your suggestions:
- So foss/2022a is actually a bunch of modules, sorry I should have provided the full list:
1) GCCcore/11.3.0 4) GCC/11.3.0 7) libxml2/2.9.13-GCCcore-11.3.0 10) OpenSSL/1.1 13) libfabric/1.15.1-GCCcore-11.3.0 16) OpenMPI/4.1.4-GCC-11.3.0 19) FFTW/3.3.10-GCC-11.3.0 22) ScaLAPACK/2.2.0-gompi-2022a-fb
2) zlib/1.2.12-GCCcore-11.3.0 5) numactl/2.0.14-GCCcore-11.3.0 8) libpciaccess/0.16-GCCcore-11.3.0 11) libevent/2.1.12-GCCcore-11.3.0 14) PMIx/4.1.2-GCCcore-11.3.0 17) OpenBLAS/0.3.20-GCC-11.3.0 20) gompi/2022a 23) foss/2022a
3) binutils/2.38-GCCcore-11.3.0 6) XZ/5.2.5-GCCcore-11.3.0 9) hwloc/2.7.1-GCCcore-11.3.0 12) UCX/1.12.1-GCCcore-11.3.0 15) UCC/1.0.0-GCCcore-11.3.0 18) FlexiBLAS/3.2.0-GCC-11.3.0 21) FFTW.MPI/3.3.10-gompi-2022a
2 and 3 are both good points. I've removed mpirun and loaded CUDA and now I just get the warning:
/home/oxfd1327/soft/amuse-gpu/amuse/src/amuse/rfi/core.py:964: UserWarning: MPI (unexpectedly?) not available, falling back to sockets channel
warnings.warn("MPI (unexpectedly?) not available, falling back to sockets channel")
And my code runs, although I don't see any speed up compared to when I configured without GPUs so I'm wondering if it is configured correctly. Do you know of a way to confirm the GPU utilisation? Running nvidia-smi before the python call shows that I am being allocated the requested GPUs but I can't see any information about their usage with seff for example.
In my script petar is called with
self.bound=code(self.converter, mode='gpu',number_of_workers=code_number_of_workers)
Is this the correct way to get petar to use GPUs? I can't see any mention of GPUs in the petar interface files.
PeTar in AMUSE currently doesn’t use the GPU, this would require at least manually modifying the Makefile but probably more modifications.
Ah ok, that makes sense. I've been trying to see if FastKick will run on the GPUs but strangely I get this error every few bridge timesteps:
Traceback (most recent call last):
File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/examples/fred/galaxy_cluster_master.py", line 353, in <module>
main(**o.__dict__)
File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/examples/fred/galaxy_cluster_master.py", line 268, in main
integrator.evolve_model(time)
File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/src/amuse/couple/bridge.py", line 598, in evolve_model
return self.evolve_joined_leapfrog(tend, timestep)
File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/src/amuse/couple/bridge.py", line 624, in evolve_joined_leapfrog
self.kick_codes(timestep / 2.0)
File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/src/amuse/couple/bridge.py", line 756, in kick_codes
de += x.kick(dt)
File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/src/amuse/couple/bridge.py", line 478, in kick
self.kick_with_field_code(
File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/src/amuse/couple/bridge.py", line 516, in kick_with_field_code
ax,ay,az=field_code.get_gravity_at_point(
File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/src/amuse/couple/bridge.py", line 146, in get_gravity_at_point
return code.get_gravity_at_point(radius, x, y, z)
File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/src/amuse/support/methods.py", line 168, in __call__
result = self.method(*list_arguments, **keyword_arguments)
File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/src/amuse/support/methods.py", line 166, in __call__
object = self.precall()
File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/src/amuse/support/methods.py", line 215, in precall
return self.definition.precall(self)
File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/src/amuse/support/interface.py", line 373, in precall
transition.do()
File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/src/amuse/support/state.py", line 123, in do
self.method.new_method()()
File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/src/amuse/support/methods.py", line 168, in __call__
result = self.method(*list_arguments, **keyword_arguments)
File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/src/amuse/support/methods.py", line 168, in __call__
result = self.method(*list_arguments, **keyword_arguments)
File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/src/amuse/support/methods.py", line 168, in __call__
result = self.method(*list_arguments, **keyword_arguments)
File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/src/amuse/support/methods.py", line 170, in __call__
result = self.convert_result(result)
File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/src/amuse/support/methods.py", line 209, in convert_result
return self.definition.convert_result(self, result)
File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/src/amuse/support/interface.py", line 682, in convert_result
return self.handle_return_value(method, result)
File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/src/amuse/support/interface.py", line 614, in handle_as_unit
unit.append_result_value(method, self, value, result)
File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/src/amuse/support/interface.py", line 70, in append_result_value
self.convert_result_value(method, definition, value)
File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/src/amuse/support/interface.py", line 80, in convert_result_value
definition.handle_errorcode(errorcode)
File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/src/amuse/support/interface.py", line 586, in handle_errorcode
raise exceptions.AmuseException(
amuse.support.exceptions.AmuseException: Error when calling 'commit_particles' of a '<class 'amuse.community.fastkick.interface.FastKick'>', errorcode is -3
It seems to happen randomly but usually after the third bridge timestep Any Idea what could be causing this? Is it a GPU configuration problem? It doesn't seem to happen with mode='cpu'
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 28 days if no further activity occurs. Thank you for your contributions.