HugeCTR
HugeCTR copied to clipboard
[BUG] Unable to run multi-node
Describe the bug Followed the instructions provided in https://github.com/NVIDIA-Merlin/HugeCTR/tree/master/tutorial/multinode-training and setup the environment exactly as suggested. Including building HugeCTR separately with MULTI_NODE_ENABLED. However when trying to run it using run_multinode.sh receive the following error -
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
To Reproduce Steps to reproduce the behavior:
- Build docker container using instructions provided here - https://nvidia-merlin.github.io/HugeCTR/master/hugectr_contributor_guide.html#how-to-start-your-development
- Configured build directory in run_multinode.sh
- bash run_multinode.sh Expected behavior Successful execution of script.
Environment (please complete the following information):
- OS: Ubuntu 18.04
- Graphic card: Nvidia P100
- CUDA version: CUDA 11.2
- Docker image - Followed the docker file provided here https://github.com/NVIDIA-Merlin/Merlin/blob/main/docker/training/dockerfile.ctr
Hi @iidsample Thanks for trying out HugeCTR! About the multinot-training tutorial, unfortunatly it's currently out of data and will be removed in next release. For now, we provide docker image in Merlin ngc which already support multi-node training for HugeCTR. You can choose to use some cluster job scheduler like srun to launch job on multinode. Thanks!
Hi @shijieliu,
Thanks for your reply. Is there some way to launch without slurm. Like just on a bunch of nodes. It will be great help if you can provide some direction or steps to do so. Thank you.
The key idea for launching multi-node training in HugeCTR is to use mpi. Like https://github.com/NVIDIA-Merlin/HugeCTR/blob/master/tutorial/multinode-training/run_multinode.sh#L110 suggests. So the steps can be:
- install and configure mpi in a bunch of nodes
- use the docker image in Merlin ngc to lanuch container in each node. Use mpi in container to launch training.
Hi @shijieliu,
Thanks for your quick reply. Unfortunately I have been having a lot trouble setting up mpi in the container to launch training. Essentially running mpirun from within the container. By any chance are you aware of a resource or have a guide about running mpi from within the container.
Thank you so much for your help.
Hi,
I have been trying to run HugeCTR in distributed mode. When I try to run mpirun with dcn_2node_8gpu.py i get the following error - Runtime error: Error: the MPI total rank doesn't match the node count
I have made sure that the number of GPU's passed is correct in vvgpu parameter.
It will be great if you can help me with this.
Hi @iidsample
Could you provide more detailed log and scripts? THX!
Hi @iidsample, We are wondering if you have solved the problem? Thanks!
Hi @iidsample ,now we have a multinode tutorial(https://github.com/NVIDIA-Merlin/HugeCTR/tree/master/tutorial/multinode-training) update, you can use script in tutorial to submit a multinode task with MPI. Please check if this update works for you.
Hi @iidsample , because this issue is opened for a long time ,and we will close issue now . If you have another question , you can reopen this issue , and comment.