tensorrtllm_backend icon indicating copy to clipboard operation
tensorrtllm_backend copied to clipboard

Can I use triton server tensorrtllm backend to host other tensorrt built models? If not what do you suggest if our models stack is mixed of LLM and non-LLM models

Open zmy1116 opened this issue 1 year ago • 9 comments

Hello,

so our current models stack consists of a set of models built in TensorRT and the whisper ASR model.

I'd like to use triton server to host all these models. Since whisper can be converted using TensoRT LLM. I tried to host all models with the triton server with tensorrtLLM backends and I see the error.

And I'm seeing this error

E1220 20:46:48.038714 1 model_lifecycle.cc:621] failed to load 'fer_2' version 1: Invalid argument: unable to find 'libtriton_tensorrt.so' or 'tensorrt/model.py' for model 'fer_2', in /opt/tritonserver/backends/tensorrt

I think this suggests that the triton server tensorrtllm backend do not support tensorrt models. Is it the case?

If so, what should I do? Or what do you recommand if we have a large models stack mixed of LLM and non-LLM models.

Thank you.

zmy1116 avatar Dec 20 '23 20:12 zmy1116

Hi,

I understand you might want to provide a full response. But if it's possible, for now I really just need a quick confirmation on if triton server tensorrtllm backend do not support tensorrt models.

Also if it's the case, some general advices so I can know what to plan for next year for my team and infrastructure wise

  • If you plan to release a general triton server that include tensorrtllm backend soon, then i will just wait.

  • Otherwise I can try to convert whisper to tensorrt engine. I foresee two challenges are 1. kv cache. 2. beam search... I see you guys have examples to convert T5 model, so I suppose I can replicate for whisper.

  • Or I can potentially convert all our model stacks to tensorrtllm engine? I mean a conv layer is a conv layer, either for tensorrt engine or tensorrtllm engine

Thank you

zmy1116 avatar Dec 28 '23 02:12 zmy1116

@zmy1116 Hi,

The error message consisting of libtriton_tensorrt.so indicates that you are trying to use the TensorRT backend to serve a specific model. And in TensorRT-LLM backend repo we haven't provided the direct TensorRT backend support.

May I understand more of your concrete scenarios? So here you have bunch of models to be served, some of them to be served with TensorRT-LLM, while for others of them you still want to serve with TensorRT backend? Is this the case?

Is it possible for you to use two Triton backends in your production environment, i.e. the TensorRT-LLM backend and the TensorRT backend, such as using two different docker images for different models' deployment purpose?

June

juney-nvidia avatar Jan 01 '24 11:01 juney-nvidia

hello,

And in TensorRT-LLM backend repo we haven't provided the direct TensorRT backend support

Do you have a plan to have a triton server that can handle both tensorrt-llm backand and tensorrt backend? Or there is technical difficulties to have a backend handling both LLM and no LLM models.

some of them to be served with TensorRT-LLM, while for others of them you still want to serve with TensorRT backend

Yes

the TensorRT-LLM backend and the TensorRT backend, such as using two different docker images for different models' deployment

If I launch 2 triton server backend, can they point to the same set of GPUs or they have to point to different sets (For instance, LLM points to gpu 0,1 , normal triton points to gpu 2,3 )?

Is there specific reason why operations like flash-attention/ kv cache / beam serach are not ported to the standard TensorRT?

Thanks

zmy1116 avatar Jan 04 '24 01:01 zmy1116

It would be great if tensorrtllm_backend can be used in Triton Inference Server with tensorrt_backend, is this in the roadmap? However, using two docker images to run LLM and non-LLM models not seems like a bad solution because Tensorrt-LLM is quite specific to LLM, involve many technique to speed up LLM, so I guess managing GPU memory will be challenging?

HKAB avatar Jan 24 '24 03:01 HKAB

We do not recommend sharing GPUs between TRT-LLM model and anything else. LLMs tend to be memory constrained and sharing GPU memory across different models, though technically possible, can cause perf degradation in both.

schetlur-nv avatar Jan 24 '24 05:01 schetlur-nv

We have the use case of currently running TensorRT variants of object detection models and would now like to run TensorRT variants of LLM models such as Llama 2 in production.

Our existing Triton server Docker image is running TensorRT, Python and DALI backends, we can't have TensorRT-LLM within the same image?

We understand performance implications of running the different workloads, we'd prefer to have everything Triton in a single image.

deadmanoz avatar Jan 25 '24 15:01 deadmanoz

We do not recommend sharing GPUs between TRT-LLM model and anything else. LLMs tend to be memory constrained and sharing GPU memory across different models, though technically possible, can cause perf degradation in both.

If I understand well, does this mean that you have no plan in near future to release a triton server that include LLM backends and TRT backends due to the technical difficulties you described?

zmy1116 avatar Jan 25 '24 17:01 zmy1116

See https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/whisper.md

yuekaizhang avatar Nov 06 '24 05:11 yuekaizhang

Revisiting this issue:

I do see from tensorrt_llm api documentation that almost all the deep learning operators (such as convolution layer, fc, pooling) are handled. Say we have a detection model based on efficientnet, I think I can rewrite the network using the tensorrt_llm layer. Is there any risk of putting this model with an llm model together in the model_repository and let tritonserver serve them both using tensorrt_llm backend?

zmy1116 avatar Nov 14 '24 23:11 zmy1116