HugeCTR [BUG] Unable to run multi-node

Describe the bug Followed the instructions provided in https://github.com/NVIDIA-Merlin/HugeCTR/tree/master/tutorial/multinode-training and setup the environment exactly as suggested. Including building HugeCTR separately with MULTI_NODE_ENABLED. However when trying to run it using run_multinode.sh receive the following error -

ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).

To Reproduce Steps to reproduce the behavior:

Build docker container using instructions provided here - https://nvidia-merlin.github.io/HugeCTR/master/hugectr_contributor_guide.html#how-to-start-your-development
Configured build directory in run_multinode.sh
bash run_multinode.sh Expected behavior Successful execution of script.

Environment (please complete the following information):

OS: Ubuntu 18.04
Graphic card: Nvidia P100
CUDA version: CUDA 11.2
Docker image - Followed the docker file provided here https://github.com/NVIDIA-Merlin/Merlin/blob/main/docker/training/dockerfile.ctr

Apr 08 '22 02:04 iidsample

Hi @iidsample Thanks for trying out HugeCTR! About the multinot-training tutorial, unfortunatly it's currently out of data and will be removed in next release. For now, we provide docker image in Merlin ngc which already support multi-node training for HugeCTR. You can choose to use some cluster job scheduler like srun to launch job on multinode. Thanks!

Apr 08 '22 08:04 shijieliu

Hi @shijieliu,

Thanks for your reply. Is there some way to launch without slurm. Like just on a bunch of nodes. It will be great help if you can provide some direction or steps to do so. Thank you.

Apr 08 '22 10:04 iidsample

The key idea for launching multi-node training in HugeCTR is to use mpi. Like https://github.com/NVIDIA-Merlin/HugeCTR/blob/master/tutorial/multinode-training/run_multinode.sh#L110 suggests. So the steps can be:

install and configure mpi in a bunch of nodes
use the docker image in Merlin ngc to lanuch container in each node. Use mpi in container to launch training.

Apr 11 '22 01:04 shijieliu

Hi @shijieliu,

Thanks for your quick reply. Unfortunately I have been having a lot trouble setting up mpi in the container to launch training. Essentially running mpirun from within the container. By any chance are you aware of a resource or have a guide about running mpi from within the container.

Thank you so much for your help.

Apr 11 '22 01:04 iidsample

Hi,

I have been trying to run HugeCTR in distributed mode. When I try to run mpirun with dcn_2node_8gpu.py i get the following error - Runtime error: Error: the MPI total rank doesn't match the node count

I have made sure that the number of GPU's passed is correct in vvgpu parameter.

It will be great if you can help me with this.

Apr 18 '22 14:04 iidsample

Hi @iidsample

Could you provide more detailed log and scripts? THX!

Apr 19 '22 06:04 shijieliu

Hi @iidsample, We are wondering if you have solved the problem? Thanks!

May 02 '22 02:05 zehuanw

Hi @iidsample ,now we have a multinode tutorial(https://github.com/NVIDIA-Merlin/HugeCTR/tree/master/tutorial/multinode-training) update, you can use script in tutorial to submit a multinode task with MPI. Please check if this update works for you.

Aug 19 '22 06:08 kanghui0204

Hi @iidsample ， because this issue is opened for a long time ,and we will close issue now . If you have another question , you can reopen this issue , and comment.

Sep 05 '22 00:09 kanghui0204

HugeCTR HugeCTR copied to clipboard

[BUG] Unable to run multi-node

HugeCTR
HugeCTR copied to clipboard