tensorrtllm_backend
tensorrtllm_backend copied to clipboard
Error launching model on Triton on multigpu nodes
Background: Setup GKE node pool with 2 H100 nodes (8GPUs each) and required NFS storage. Trying to serve Llama3 405B model after checkpoint conversion and building TRT-LLM engine.
Environment:
Triton image: nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3
Triton version: 0.16
mpirun (Open MPI) 4.1.5rc2
Issue:
Launching on leader node using command line:
python3 /var/run/models/tensorrtllm_backend/scripts/launch_triton_server.py --model_repo=/var/run/models/tensorrtllm_backend/triton_model_repo --world_size 16
it fails with error:
Triton model repository is at:'/var/run/models/tensorrtllm_backend/triton_model_repo' Server is assuming each node has 8 GPUs. To change this, use --gpu_per_node Executing Leader (world size: 16) Begin waiting for worker pods.
kubectl get pods -n default -l leaderworkerset.sigs.k8s.io/group-key=<redacted> --field-selector status.phase=Running -o jsonpath='{.items[*].metadata.name}' 'triton-trtllm-0 triton-trtllm-0-1' 2 of 2.0 workers ready.
ORTE was unable to reliably start one or more daemons. This usually is caused by:
-
not finding the required libraries and/or binaries on one or more nodes. Please check your PATH and LD_LIBRARY_PATH settings, or configure OMPI with --enable-orterun-prefix-by-default
-
lack of authority to execute on one or more specified nodes. Please verify your allocation and authorities.
-
the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). Please check with your sys admin to determine the correct location to use.
-
compilation of the orted with dynamic libraries when static are required (e.g., on Cray). Please check your configure cmd line and consider using one of the contrib/platform definitions for your system type.
-
an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements).
[triton-trtllm-0:00238] Job UNKNOWN has launched [triton-trtllm-0:00238] [[41321,0],0] Releasing job data for [41321,1] [triton-trtllm-0:00238] sess_dir_finalize: proc session dir does not exist [triton-trtllm-0:00238] sess_dir_finalize: job session dir does not exist [triton-trtllm-0:00238] sess_dir_finalize: jobfam session dir does not exist [triton-trtllm-0:00238] sess_dir_finalize: jobfam session dir does not exist [triton-trtllm-0:00238] sess_dir_finalize: top session dir does not exist [triton-trtllm-0:00238] sess_dir_cleanup: job session dir does not exist [triton-trtllm-0:00238] sess_dir_cleanup: top session dir does not exist [triton-trtllm-0:00238] [[41321,0],0] Releasing job data for [41321,0] [triton-trtllm-0:00238] sess_dir_cleanup: job session dir does not exist [triton-trtllm-0:00238] sess_dir_cleanup: top session dir does not exist exiting with status 1 Waiting 15 second before exiting.
Launched on non-leader node
Triton model repository is at:'/var/run/models/tensorrtllm_backend/triton_model_repo' Worker paused awaiting SIGINT or SIGTERM.
Verified: mpirun is on the path.
Question:
mpirun works fine in a single node. Is there any configuration that needs to be done when mpirun spans multiple nodes.