tensorrtllm_backend Error launching model on Triton on multigpu nodes

Error launching model on Triton on multigpu nodes

Open sujituk opened this issue 9 months ago • 0 comments

Background: Setup GKE node pool with 2 H100 nodes (8GPUs each) and required NFS storage. Trying to serve Llama3 405B model after checkpoint conversion and building TRT-LLM engine.

Environment: Triton image: nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3 Triton version: 0.16 mpirun (Open MPI) 4.1.5rc2

Issue: Launching on leader node using command line: python3 /var/run/models/tensorrtllm_backend/scripts/launch_triton_server.py --model_repo=/var/run/models/tensorrtllm_backend/triton_model_repo --world_size 16

it fails with error:

Triton model repository is at:'/var/run/models/tensorrtllm_backend/triton_model_repo' Server is assuming each node has 8 GPUs. To change this, use --gpu_per_node Executing Leader (world size: 16) Begin waiting for worker pods.

kubectl get pods -n default -l leaderworkerset.sigs.k8s.io/group-key=<redacted> --field-selector status.phase=Running -o jsonpath='{.items[*].metadata.name}' 'triton-trtllm-0 triton-trtllm-0-1' 2 of 2.0 workers ready.

ORTE was unable to reliably start one or more daemons. This usually is caused by:

not finding the required libraries and/or binaries on one or more nodes. Please check your PATH and LD_LIBRARY_PATH settings, or configure OMPI with --enable-orterun-prefix-by-default
lack of authority to execute on one or more specified nodes. Please verify your allocation and authorities.
the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). Please check with your sys admin to determine the correct location to use.
compilation of the orted with dynamic libraries when static are required (e.g., on Cray). Please check your configure cmd line and consider using one of the contrib/platform definitions for your system type.
an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements).

[triton-trtllm-0:00238] Job UNKNOWN has launched [triton-trtllm-0:00238] [[41321,0],0] Releasing job data for [41321,1] [triton-trtllm-0:00238] sess_dir_finalize: proc session dir does not exist [triton-trtllm-0:00238] sess_dir_finalize: job session dir does not exist [triton-trtllm-0:00238] sess_dir_finalize: jobfam session dir does not exist [triton-trtllm-0:00238] sess_dir_finalize: jobfam session dir does not exist [triton-trtllm-0:00238] sess_dir_finalize: top session dir does not exist [triton-trtllm-0:00238] sess_dir_cleanup: job session dir does not exist [triton-trtllm-0:00238] sess_dir_cleanup: top session dir does not exist [triton-trtllm-0:00238] [[41321,0],0] Releasing job data for [41321,0] [triton-trtllm-0:00238] sess_dir_cleanup: job session dir does not exist [triton-trtllm-0:00238] sess_dir_cleanup: top session dir does not exist exiting with status 1 Waiting 15 second before exiting.

Launched on non-leader node Triton model repository is at:'/var/run/models/tensorrtllm_backend/triton_model_repo' Worker paused awaiting SIGINT or SIGTERM. Verified: mpirun is on the path.

Question: mpirun works fine in a single node. Is there any configuration that needs to be done when mpirun spans multiple nodes.

Feb 07 '25 23:02 sujituk

tensorrtllm_backend tensorrtllm_backend copied to clipboard

Error launching model on Triton on multigpu nodes

tensorrtllm_backend
tensorrtllm_backend copied to clipboard