HugeCTR
HugeCTR copied to clipboard
Support for configuration issues
There is an issue related mlperf dlrmv2, original link is: https://github.com/mlcommons/training_results_v3.0/issues/5
Describe the bug Hi , AM trying to bringup the setup for multinode GPU Hugectr training benchmark using the code https://github.com/mlcommons/training_results_v3.0/tree/main/NVIDIA/benchmarks/dlrm_dcnv2/implementations/hugectr
For single node am able to run the benchmark test , but while am executing the multinode (say 2 node) am facing issue shown below , could you please help me resolving this issue??
[HCTR][17:28:15.456][WARNING][RK0][main]: The model name is not specified when creating the solver. [1695144496.484294] [hpci5201:103648:0] ib_device.c:1250 UCX ERROR ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:192.160.0.55 sgid_index=3 traffic_class=106) for UD verbs connect on bnxt_re0 failed: Connection timed out [hpci5201:103648] pml_ucx.c:419 Error: ucp_ep_create(proc=1) failed: Endpoint timeout [hpci5201:103648] pml_ucx.c:472 Error: Failed to resolve UCX endpoint for rank 1 Traceback (most recent call last): File "/dev/shm/data/hugectl/train.py", line 344, in model = hugectr.Model(solver, reader, optimizer) RuntimeError: Runtime error: MPI_ERR_OTHER: known error not in list MPI_Bcast(&seed, 1, (static_cast<MPI_Datatype> (static_cast<void *> (&(ompi_mpi_unsigned_long_long)))), 0, (static_cast<MPI_Comm> (static_cast<void *> (&(ompi_mpi_comm_world))))) at create (/workspace/dlrm/hugectr/HugeCTR/src/resource_managers/resource_manager_ext.cpp:39)
To Reproduce Steps to reproduce the behavior:
- How to build including
docker pull & docker run
commands - How to run including the JSON config file used
Expected behavior A clear and concise description of what you expected to happen.
Screenshots If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
- OS: [e.g. Ubuntu xx.yy]
- Graphic card: [e.g. a single NVIDIA V100 or NVIDIA DGX A100]
- CUDA version: [e.g. CUDA 11.x]
- Docker image
Additional context
Hi RaghavendraChari, I can't reproduce this error on 2 node in our cluster even I built the image from 'training_results_v3.0' repo. Could you provide the detail reproduce steps? How did you build the image? What's the configurations you used? Which GPU you used? Thanks!