tensorrtllm_backend
tensorrtllm_backend copied to clipboard
Can I use triton server tensorrtllm backend to host other tensorrt built models? If not what do you suggest if our models stack is mixed of LLM and non-LLM models
Hello,
so our current models stack consists of a set of models built in TensorRT and the whisper ASR model.
I'd like to use triton server to host all these models. Since whisper can be converted using TensoRT LLM. I tried to host all models with the triton server with tensorrtLLM backends and I see the error.
And I'm seeing this error
E1220 20:46:48.038714 1 model_lifecycle.cc:621] failed to load 'fer_2' version 1: Invalid argument: unable to find 'libtriton_tensorrt.so' or 'tensorrt/model.py' for model 'fer_2', in /opt/tritonserver/backends/tensorrt
I think this suggests that the triton server tensorrtllm backend do not support tensorrt models. Is it the case?
If so, what should I do? Or what do you recommand if we have a large models stack mixed of LLM and non-LLM models.
Thank you.
Hi,
I understand you might want to provide a full response. But if it's possible, for now I really just need a quick confirmation on if triton server tensorrtllm backend do not support tensorrt models.
Also if it's the case, some general advices so I can know what to plan for next year for my team and infrastructure wise
-
If you plan to release a general triton server that include tensorrtllm backend soon, then i will just wait.
-
Otherwise I can try to convert whisper to tensorrt engine. I foresee two challenges are 1. kv cache. 2. beam search... I see you guys have examples to convert T5 model, so I suppose I can replicate for whisper.
-
Or I can potentially convert all our model stacks to tensorrtllm engine? I mean a conv layer is a conv layer, either for tensorrt engine or tensorrtllm engine
Thank you
@zmy1116 Hi,
The error message consisting of libtriton_tensorrt.so indicates that you are trying to use the TensorRT backend to serve a specific model. And in TensorRT-LLM backend repo we haven't provided the direct TensorRT backend support.
May I understand more of your concrete scenarios? So here you have bunch of models to be served, some of them to be served with TensorRT-LLM, while for others of them you still want to serve with TensorRT backend? Is this the case?
Is it possible for you to use two Triton backends in your production environment, i.e. the TensorRT-LLM backend and the TensorRT backend, such as using two different docker images for different models' deployment purpose?
June
hello,
And in TensorRT-LLM backend repo we haven't provided the direct TensorRT backend support
Do you have a plan to have a triton server that can handle both tensorrt-llm backand and tensorrt backend? Or there is technical difficulties to have a backend handling both LLM and no LLM models.
some of them to be served with TensorRT-LLM, while for others of them you still want to serve with TensorRT backend
Yes
the TensorRT-LLM backend and the TensorRT backend, such as using two different docker images for different models' deployment
If I launch 2 triton server backend, can they point to the same set of GPUs or they have to point to different sets (For instance, LLM points to gpu 0,1 , normal triton points to gpu 2,3 )?
Is there specific reason why operations like flash-attention/ kv cache / beam serach are not ported to the standard TensorRT?
Thanks
It would be great if tensorrtllm_backend can be used in Triton Inference Server with tensorrt_backend, is this in the roadmap? However, using two docker images to run LLM and non-LLM models not seems like a bad solution because Tensorrt-LLM is quite specific to LLM, involve many technique to speed up LLM, so I guess managing GPU memory will be challenging?
We do not recommend sharing GPUs between TRT-LLM model and anything else. LLMs tend to be memory constrained and sharing GPU memory across different models, though technically possible, can cause perf degradation in both.
We have the use case of currently running TensorRT variants of object detection models and would now like to run TensorRT variants of LLM models such as Llama 2 in production.
Our existing Triton server Docker image is running TensorRT, Python and DALI backends, we can't have TensorRT-LLM within the same image?
We understand performance implications of running the different workloads, we'd prefer to have everything Triton in a single image.
We do not recommend sharing GPUs between TRT-LLM model and anything else. LLMs tend to be memory constrained and sharing GPU memory across different models, though technically possible, can cause perf degradation in both.
If I understand well, does this mean that you have no plan in near future to release a triton server that include LLM backends and TRT backends due to the technical difficulties you described?
See https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/whisper.md
Revisiting this issue:
I do see from tensorrt_llm api documentation that almost all the deep learning operators (such as convolution layer, fc, pooling) are handled. Say we have a detection model based on efficientnet, I think I can rewrite the network using the tensorrt_llm layer. Is there any risk of putting this model with an llm model together in the model_repository and let tritonserver serve them both using tensorrt_llm backend?