TensorRT-LLM How to serve multiple TensorRT-LLM models in the same process / server?

Hi there! I'm trying to serve multiple TensorRT-LLM models and I'm wondering what the recommended approach is. I'm using Python to serve TensorRT-LLM models. I've tried / considered:

GenerationSession: I tried instantiating two GenerationSession objects and running inference against both sessions by sending each session one request at a time (i.e. both sessions are processing only a single request, but the sessions are running concurrently) but I ran into errors. Not sure if this is expected.
GptManager: If I understand correctly, the GptManager runs a generation loop for a single model only, so a single Python process can only support one model.
Triton Inference Server's TensorRT-LLM backend: It looks like the backend only supports serving one model per server as it uses GptManager internally.

Is it possible to serve multiple TensorRT-LLM models in the same process / server? Or do I need to host TensorRT-LLM models on separate processes / servers?

Jan 27 '24 01:01 cody-moveworks

We are working on adding multiple models in the Triton backend using MPI processes.

A similar approach could be used to implement support with a GptManager per process.

Jan 31 '24 02:01 achartier

Multi-model support was part of the v0.9 release. See https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#launch-triton-server and the section regarding the --multi-model option.

Apr 17 '24 03:04 achartier

@achartier If I am to understand correctly from:

When using the --multi-model option, the Triton model repository can contain multiple TensorRT-LLM models. When running multiple TensorRT-LLM models, the gpu_device_ids parameter should be specified in the models config.pbtxt configuration files. It is up to you to ensure there is no overlap between allocated GPU IDs.

If I want to deploy 4 different LLM models using triton, I need a server with 4 GPUs? Since there must not be overlap between allocated GPU IDs?

Jul 13 '24 14:07 kalradivyanshu

Ideally yes. The TRT-LLM Triton backend does not check if there is an overlap, so it will let you deploy multiple models on a single GPU, but you'll need to adjust the KV cache size to ensure there is enough device memory for each model and this is not a supported use case.

Jul 13 '24 15:07 achartier

@achartier do we have an example on how we can server multiple models TRT-LLM models using triton. Like deploying two LLM models.

Jul 22 '24 16:07 anubhav-agrawal-mu-sigma

Yes, see the link to the documentation in my April 16 message.

Jul 22 '24 16:07 achartier

do u still have further issue or question now? If not, we'll close it soon.

Nov 17 '24 15:11 nv-guomingz

Ideally yes. The TRT-LLM Triton backend does not check if there is an overlap, so it will let you deploy multiple models on a single GPU, but you'll need to adjust the KV cache size to ensure there is enough device memory for each model and this is not a supported use case.

@nv-guomingz if I deploy 4 models on single GPU, in addition to adjust the KV cache size, we also need to reserve 4x GPU memory buffer for forward inference, is that correct? What parameters can control the size of KV cache and forward inference GPU memory buffer?

Dec 05 '24 02:12 LinHR000

What parameters can control the size of KV cache and forward inference GPU memory buffer?

Using the executor API, this is controlled by the KvCacheConfig class: https://github.com/NVIDIA/TensorRT-LLM/blob/548b5b73106aaf7374955e1c37aad677678ebc7b/cpp/include/tensorrt_llm/executor/executor.h#L859

Dec 05 '24 02:12 achartier

Yes, see the link to the documentation in my April 16 message.

This link shows multi-models (models containing text, vision, etc.). But does Triton + TRT-LLM backend support serving multiple TRT-LLM models e.g. first model: Llama3.2 with 2 engine files, second model: Deepseek v3 with 4 engine files, third model: Qwen2 with one engine file etc. all under one endpoint? I have 12 RTX 3090/4090 GPUs.

Feb 28 '25 15:02 dduck999