HugeCTR icon indicating copy to clipboard operation
HugeCTR copied to clipboard

Support for configuration issues

Open EmmaQiaoCh opened this issue 1 year ago • 1 comments

There is an issue related mlperf dlrmv2, original link is: https://github.com/mlcommons/training_results_v3.0/issues/5

Describe the bug Hi , AM trying to bringup the setup for multinode GPU Hugectr training benchmark using the code https://github.com/mlcommons/training_results_v3.0/tree/main/NVIDIA/benchmarks/dlrm_dcnv2/implementations/hugectr

For single node am able to run the benchmark test , but while am executing the multinode (say 2 node) am facing issue shown below , could you please help me resolving this issue??

[HCTR][17:28:15.456][WARNING][RK0][main]: The model name is not specified when creating the solver. [1695144496.484294] [hpci5201:103648:0] ib_device.c:1250 UCX ERROR ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:192.160.0.55 sgid_index=3 traffic_class=106) for UD verbs connect on bnxt_re0 failed: Connection timed out [hpci5201:103648] pml_ucx.c:419 Error: ucp_ep_create(proc=1) failed: Endpoint timeout [hpci5201:103648] pml_ucx.c:472 Error: Failed to resolve UCX endpoint for rank 1 Traceback (most recent call last): File "/dev/shm/data/hugectl/train.py", line 344, in model = hugectr.Model(solver, reader, optimizer) RuntimeError: Runtime error: MPI_ERR_OTHER: known error not in list MPI_Bcast(&seed, 1, (static_cast<MPI_Datatype> (static_cast<void *> (&(ompi_mpi_unsigned_long_long)))), 0, (static_cast<MPI_Comm> (static_cast<void *> (&(ompi_mpi_comm_world))))) at create (/workspace/dlrm/hugectr/HugeCTR/src/resource_managers/resource_manager_ext.cpp:39)

To Reproduce Steps to reproduce the behavior:

  1. How to build including docker pull & docker run commands
  2. How to run including the JSON config file used

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

  • OS: [e.g. Ubuntu xx.yy]
  • Graphic card: [e.g. a single NVIDIA V100 or NVIDIA DGX A100]
  • CUDA version: [e.g. CUDA 11.x]
  • Docker image

Additional context

EmmaQiaoCh avatar Oct 25 '23 11:10 EmmaQiaoCh

Hi RaghavendraChari, I can't reproduce this error on 2 node in our cluster even I built the image from 'training_results_v3.0' repo. Could you provide the detail reproduce steps? How did you build the image? What's the configurations you used? Which GPU you used? Thanks!

EmmaQiaoCh avatar Oct 26 '23 07:10 EmmaQiaoCh