tensorrtllm_backend v0.11.0 release fails when TP>1

System Info

CPU: x86_64
GPUs: 8x H100 80GB HBM3
Driver: 550.90.07
CUDA: 12.4
TensorRT-LLM: v0.11.0
tensorrtllm_backend: v0.11.0

Who can help?

@kaiyux

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Build an engine with TP>1 (I've tried gpt2 and llama3 8b, llama3.1 8b with TP2, TP4 so far) and start triton with inflight batching config. The issue seems to be specific to TP>1 and this backend, because I've ruled out:

TP=1 with corresponding engine boots successfully, with the same config (only had to change gpu_device_ids);
TP>1 engine can be loaded and generate output using TensorRT-LLM examples/run.py script;

Hence it is likely an issue with tensorrtllm_backend + mpi.

Expected behavior

Server should boot up successfully. Using a supported model, official release, no customizations.

actual behavior

[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 4096
[TensorRT-LLM][INFO] TRTGptModel computeContextLogits: 0
[TensorRT-LLM][INFO] TRTGptModel computeGenerationLogits: 0
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 4096
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 4095 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] Capacity Scheduler Policy: MAX_UTILIZATION
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][WARNING] The value of maxAttentionWindow cannot exceed mMaxSequenceLen. Therefore, it has been adjusted to match the value of mMaxSequenceLen.
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 64
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 4096
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 4096
[TensorRT-LLM][INFO] TRTGptModel computeContextLogits: 0
[TensorRT-LLM][INFO] TRTGptModel computeGenerationLogits: 0
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 4096
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 4095 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] Capacity Scheduler Policy: MAX_UTILIZATION
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 3844 MiB
[TensorRT-LLM][INFO] Loaded engine size: 3849 MiB
[TensorRT-LLM][INFO] Loaded engine size: 3845 MiB
[TensorRT-LLM][INFO] Loaded engine size: 3844 MiB
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[192-222-52-240:00289] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[192-222-52-240:00289] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[192-222-52-240:00289] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[192-222-52-240:00289] 3 more processes have sent help message help-mpi-api.txt / mpi-abort
[192-222-52-240:00289] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Triton logs:

I0812 04:47:11.717997 43252 pinned_memory_manager.cc:275] "Pinned memory pool is created at '0x7f9500000000' with size 268435456"
I0812 04:47:11.757600 43252 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I0812 04:47:11.757689 43252 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 1 with size 67108864"
I0812 04:47:11.757760 43252 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 2 with size 67108864"
I0812 04:47:11.760211 43252 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 3 with size 67108864"
I0812 04:47:13.811956 43252 model_lifecycle.cc:472] "loading: tensorrt_llm:1"

additional notes

It seems TP is straight up broken in v0.11 release, since I've setup everything according to documented steps.

Aug 12 '24 17:08 daulet

Seems like this is not fixed yet, I am still experiencing the same issues for v0.17.

Feb 27 '25 01:02 jasonngap1

Same with v0.19

Jun 26 '25 06:06 thakkar2804

tensorrtllm_backend tensorrtllm_backend copied to clipboard

v0.11.0 release fails when TP>1

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

tensorrtllm_backend
tensorrtllm_backend copied to clipboard