tensorrtllm_backend icon indicating copy to clipboard operation
tensorrtllm_backend copied to clipboard

Error launching model on Triton on multigpu nodes

Open sujituk opened this issue 9 months ago • 0 comments

Background: Setup GKE node pool with 2 H100 nodes (8GPUs each) and required NFS storage. Trying to serve Llama3 405B model after checkpoint conversion and building TRT-LLM engine.

Environment: Triton image: nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3 Triton version: 0.16 mpirun (Open MPI) 4.1.5rc2

Issue: Launching on leader node using command line: python3 /var/run/models/tensorrtllm_backend/scripts/launch_triton_server.py --model_repo=/var/run/models/tensorrtllm_backend/triton_model_repo --world_size 16

it fails with error:


Triton model repository is at:'/var/run/models/tensorrtllm_backend/triton_model_repo' Server is assuming each node has 8 GPUs. To change this, use --gpu_per_node Executing Leader (world size: 16) Begin waiting for worker pods.

kubectl get pods -n default -l leaderworkerset.sigs.k8s.io/group-key=<redacted> --field-selector status.phase=Running -o jsonpath='{.items[*].metadata.name}' 'triton-trtllm-0 triton-trtllm-0-1' 2 of 2.0 workers ready.

ORTE was unable to reliably start one or more daemons. This usually is caused by:

  • not finding the required libraries and/or binaries on one or more nodes. Please check your PATH and LD_LIBRARY_PATH settings, or configure OMPI with --enable-orterun-prefix-by-default

  • lack of authority to execute on one or more specified nodes. Please verify your allocation and authorities.

  • the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). Please check with your sys admin to determine the correct location to use.

  • compilation of the orted with dynamic libraries when static are required (e.g., on Cray). Please check your configure cmd line and consider using one of the contrib/platform definitions for your system type.

  • an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements).


[triton-trtllm-0:00238] Job UNKNOWN has launched [triton-trtllm-0:00238] [[41321,0],0] Releasing job data for [41321,1] [triton-trtllm-0:00238] sess_dir_finalize: proc session dir does not exist [triton-trtllm-0:00238] sess_dir_finalize: job session dir does not exist [triton-trtllm-0:00238] sess_dir_finalize: jobfam session dir does not exist [triton-trtllm-0:00238] sess_dir_finalize: jobfam session dir does not exist [triton-trtllm-0:00238] sess_dir_finalize: top session dir does not exist [triton-trtllm-0:00238] sess_dir_cleanup: job session dir does not exist [triton-trtllm-0:00238] sess_dir_cleanup: top session dir does not exist [triton-trtllm-0:00238] [[41321,0],0] Releasing job data for [41321,0] [triton-trtllm-0:00238] sess_dir_cleanup: job session dir does not exist [triton-trtllm-0:00238] sess_dir_cleanup: top session dir does not exist exiting with status 1 Waiting 15 second before exiting.

Launched on non-leader node Triton model repository is at:'/var/run/models/tensorrtllm_backend/triton_model_repo' Worker paused awaiting SIGINT or SIGTERM. Verified: mpirun is on the path.

Question: mpirun works fine in a single node. Is there any configuration that needs to be done when mpirun spans multiple nodes.

sujituk avatar Feb 07 '25 23:02 sujituk