How to serve multiple TensorRT-LLM models in the same process / server?
Hi there! I'm trying to serve multiple TensorRT-LLM models and I'm wondering what the recommended approach is. I'm using Python to serve TensorRT-LLM models. I've tried / considered:
GenerationSession: I tried instantiating twoGenerationSessionobjects and running inference against both sessions by sending each session one request at a time (i.e. both sessions are processing only a single request, but the sessions are running concurrently) but I ran into errors. Not sure if this is expected.GptManager: If I understand correctly, theGptManagerruns a generation loop for a single model only, so a single Python process can only support one model.- Triton Inference Server's TensorRT-LLM backend: It looks like the backend only supports serving one model per server as it uses
GptManagerinternally.
Is it possible to serve multiple TensorRT-LLM models in the same process / server? Or do I need to host TensorRT-LLM models on separate processes / servers?
We are working on adding multiple models in the Triton backend using MPI processes.
A similar approach could be used to implement support with a GptManager per process.
Multi-model support was part of the v0.9 release. See https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#launch-triton-server and the section regarding the --multi-model option.
@achartier If I am to understand correctly from:
When using the --multi-model option, the Triton model repository can contain multiple TensorRT-LLM models. When running multiple TensorRT-LLM models, the gpu_device_ids parameter should be specified in the models config.pbtxt configuration files. It is up to you to ensure there is no overlap between allocated GPU IDs.
If I want to deploy 4 different LLM models using triton, I need a server with 4 GPUs? Since there must not be overlap between allocated GPU IDs?
Ideally yes. The TRT-LLM Triton backend does not check if there is an overlap, so it will let you deploy multiple models on a single GPU, but you'll need to adjust the KV cache size to ensure there is enough device memory for each model and this is not a supported use case.
@achartier do we have an example on how we can server multiple models TRT-LLM models using triton. Like deploying two LLM models.
Yes, see the link to the documentation in my April 16 message.
do u still have further issue or question now? If not, we'll close it soon.
Ideally yes. The TRT-LLM Triton backend does not check if there is an overlap, so it will let you deploy multiple models on a single GPU, but you'll need to adjust the KV cache size to ensure there is enough device memory for each model and this is not a supported use case.
@nv-guomingz if I deploy 4 models on single GPU, in addition to adjust the KV cache size, we also need to reserve 4x GPU memory buffer for forward inference, is that correct? What parameters can control the size of KV cache and forward inference GPU memory buffer?
What parameters can control the size of KV cache and forward inference GPU memory buffer?
Using the executor API, this is controlled by the KvCacheConfig class: https://github.com/NVIDIA/TensorRT-LLM/blob/548b5b73106aaf7374955e1c37aad677678ebc7b/cpp/include/tensorrt_llm/executor/executor.h#L859
Yes, see the link to the documentation in my April 16 message.
This link shows multi-models (models containing text, vision, etc.). But does Triton + TRT-LLM backend support serving multiple TRT-LLM models e.g. first model: Llama3.2 with 2 engine files, second model: Deepseek v3 with 4 engine files, third model: Qwen2 with one engine file etc. all under one endpoint? I have 12 RTX 3090/4090 GPUs.