signac-flow
signac-flow copied to clipboard
Template Multinode GPU Error
Description
For the GreatLakes (and potentially others) template. Multi-node GPU submissions incorrectly set the --ntasks-per-node to be the total number of tasks disregarding individual node size.
To Reproduce
Just request a multi-GPU node submission with --pretend and view the output. Here is an example
#SBATCH --job-name="TempProject/42b7b4f2921788ea14dac5566e6f06d0/foo/13ee8c7cb17a11b218fe41a3e31afab3"
#SBATCH --partition=gpu
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=8
#SBATCH --gpus=8
Given the request that nranks=4 and ngpu=8, this should be --ntasks-per-node=2 as there are 2 GPUs per node for the GPU cluster of GreatLakes.
This problem may exist in other environments and was propagated to #561, so we should check other templates for this logical error.
While #722 produces a proper resource request for sbatch, it fails to work correctly: This request
#SBATCH --nodes=2
#SBATCH --ntasks=4
#SBATCH --gpus=4
...
# hard_disk_nvt_gpu(agg-88ab2b6305a8b249743764fc16e5c22e)
/home/joaander/miniforge3/bin/python /gpfs/accounts/sglotzer_root/sglotzer0/joaander/hoomd-validation/hoomd_validation/project.py run -o hard_disk_nvt_gpu -j agg-88ab2b6305a8b249743764fc16e5c22e
# Eligible to run:
# mpirun -n 4 /home/joaander/miniforge3/bin/python /gpfs/accounts/sglotzer_root/sglotzer0/joaander/hoomd-validation/hoomd_validation/project.py exec hard_disk_nvt_gpu agg-88ab2b6305a8b249743764fc16e5c22e
produces:
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
When I use this instead:
#SBATCH --ntasks=4
#SBATCH --gpus-per-task=1
#SBATCH --mem-per-cpu="30g"
mpirun is able to launch hoomd, but somehow SLURM_LOCALID is 0 on all ranks.... I will troubleshoot that further when testing the solution #777. In the meantime, multi-GPU jobs are still a bug on Great Lakes.