signac-flow Template Multinode GPU Error

Template Multinode GPU Error

Open b-butler opened this issue 4 years ago • 1 comments

Description

For the GreatLakes (and potentially others) template. Multi-node GPU submissions incorrectly set the --ntasks-per-node to be the total number of tasks disregarding individual node size.

To Reproduce

Just request a multi-GPU node submission with --pretend and view the output. Here is an example

#SBATCH --job-name="TempProject/42b7b4f2921788ea14dac5566e6f06d0/foo/13ee8c7cb17a11b218fe41a3e31afab3"
#SBATCH --partition=gpu
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=8
#SBATCH --gpus=8

Given the request that nranks=4 and ngpu=8, this should be --ntasks-per-node=2 as there are 2 GPUs per node for the GPU cluster of GreatLakes.

This problem may exist in other environments and was propagated to #561, so we should check other templates for this logical error.

Aug 20 '21 20:08 b-butler

While #722 produces a proper resource request for sbatch, it fails to work correctly: This request

#SBATCH --nodes=2
#SBATCH --ntasks=4
#SBATCH --gpus=4
...
# hard_disk_nvt_gpu(agg-88ab2b6305a8b249743764fc16e5c22e)
/home/joaander/miniforge3/bin/python /gpfs/accounts/sglotzer_root/sglotzer0/joaander/hoomd-validation/hoomd_validation/project.py run -o hard_disk_nvt_gpu -j agg-88ab2b6305a8b249743764fc16e5c22e
# Eligible to run:
# mpirun -n 4  /home/joaander/miniforge3/bin/python /gpfs/accounts/sglotzer_root/sglotzer0/joaander/hoomd-validation/hoomd_validation/project.py exec hard_disk_nvt_gpu agg-88ab2b6305a8b249743764fc16e5c22e

produces:

An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).

When I use this instead:

#SBATCH --ntasks=4                                                                                                                                                                                                                                                      
#SBATCH --gpus-per-task=1                                                                                                                                                                                                                                               
#SBATCH --mem-per-cpu="30g"

mpirun is able to launch hoomd, but somehow SLURM_LOCALID is 0 on all ranks.... I will troubleshoot that further when testing the solution #777. In the meantime, multi-GPU jobs are still a bug on Great Lakes.

Oct 31 '23 15:10 joaander

signac-flow signac-flow copied to clipboard

Template Multinode GPU Error

Description

To Reproduce

signac-flow
signac-flow copied to clipboard