HugeCTR icon indicating copy to clipboard operation
HugeCTR copied to clipboard

[BUG] Unable to run multi-node

Open iidsample opened this issue 2 years ago • 7 comments

Describe the bug Followed the instructions provided in https://github.com/NVIDIA-Merlin/HugeCTR/tree/master/tutorial/multinode-training and setup the environment exactly as suggested. Including building HugeCTR separately with MULTI_NODE_ENABLED. However when trying to run it using run_multinode.sh receive the following error -

ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).

To Reproduce Steps to reproduce the behavior:

  1. Build docker container using instructions provided here - https://nvidia-merlin.github.io/HugeCTR/master/hugectr_contributor_guide.html#how-to-start-your-development
  2. Configured build directory in run_multinode.sh
  3. bash run_multinode.sh Expected behavior Successful execution of script.

Environment (please complete the following information):

  • OS: Ubuntu 18.04
  • Graphic card: Nvidia P100
  • CUDA version: CUDA 11.2
  • Docker image - Followed the docker file provided here https://github.com/NVIDIA-Merlin/Merlin/blob/main/docker/training/dockerfile.ctr

iidsample avatar Apr 08 '22 02:04 iidsample

Hi @iidsample Thanks for trying out HugeCTR! About the multinot-training tutorial, unfortunatly it's currently out of data and will be removed in next release. For now, we provide docker image in Merlin ngc which already support multi-node training for HugeCTR. You can choose to use some cluster job scheduler like srun to launch job on multinode. Thanks!

shijieliu avatar Apr 08 '22 08:04 shijieliu

Hi @shijieliu,

Thanks for your reply. Is there some way to launch without slurm. Like just on a bunch of nodes. It will be great help if you can provide some direction or steps to do so. Thank you.

iidsample avatar Apr 08 '22 10:04 iidsample

The key idea for launching multi-node training in HugeCTR is to use mpi. Like https://github.com/NVIDIA-Merlin/HugeCTR/blob/master/tutorial/multinode-training/run_multinode.sh#L110 suggests. So the steps can be:

  1. install and configure mpi in a bunch of nodes
  2. use the docker image in Merlin ngc to lanuch container in each node. Use mpi in container to launch training.

shijieliu avatar Apr 11 '22 01:04 shijieliu

Hi @shijieliu,

Thanks for your quick reply. Unfortunately I have been having a lot trouble setting up mpi in the container to launch training. Essentially running mpirun from within the container. By any chance are you aware of a resource or have a guide about running mpi from within the container.

Thank you so much for your help.

iidsample avatar Apr 11 '22 01:04 iidsample

Hi,

I have been trying to run HugeCTR in distributed mode. When I try to run mpirun with dcn_2node_8gpu.py i get the following error - Runtime error: Error: the MPI total rank doesn't match the node count

I have made sure that the number of GPU's passed is correct in vvgpu parameter.

It will be great if you can help me with this.

iidsample avatar Apr 18 '22 14:04 iidsample

Hi @iidsample

Could you provide more detailed log and scripts? THX!

shijieliu avatar Apr 19 '22 06:04 shijieliu

Hi @iidsample, We are wondering if you have solved the problem? Thanks!

zehuanw avatar May 02 '22 02:05 zehuanw

Hi @iidsample ,now we have a multinode tutorial(https://github.com/NVIDIA-Merlin/HugeCTR/tree/master/tutorial/multinode-training) update, you can use script in tutorial to submit a multinode task with MPI. Please check if this update works for you.

kanghui0204 avatar Aug 19 '22 06:08 kanghui0204

Hi @iidsample , because this issue is opened for a long time ,and we will close issue now . If you have another question , you can reopen this issue , and comment.

kanghui0204 avatar Sep 05 '22 00:09 kanghui0204