tensorrtllm_backend
tensorrtllm_backend copied to clipboard
The Triton TensorRT-LLM Backend
Can you provide an example of a visual language model or multimodal model launch by triton server?
there is an example https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/qwenvl , but I have no idea how can I use this model in triton server, Can you provide an example of a visual language model...
TensorRT-LLM Backend I have built via docker. But the size of docker image is too big than the image in NGC. How to decrease the size?  this is the...
### System Info x86 Ryzen 5950x v0.8.0 ubuntu 22.04 Rtx 3090 ### Who can help? @byshiue @schetlur-nv ### Information - [X] The official example scripts - [ ] My own...
When I use tensorrt_llm_bls, the first token takes very long time. It looks like the queue is blocked. Use tensorrt_llm and ensemle didn't encounter this problem How should I troubleshoot...
### System Info - CPU: amd64 - OS: Debian 12 - GPU: nvidia rtx4000 ada - GPU driver: 535.161 - TensorRT-LLM version: 0.8 - tensorrtllm_backend version: 0.8 ### Who can...
### System Info CPU Architecture: AMD EPYC 7V13 64-Core Processor CPU/Host memory size: 440 GPU properties: A800 80GB GPU name: NVIDIA A800 80GB x2 GPU mem size: 80Gb x 2...
I Need a no-Docker version ,case when I have no root privileges. So, How to build TensorRT-LLM Backend without Docker I find no build.py in code, and cmake the file...
I am trying to deploy a Baichuan2-7B model on a machine with 2 Tesla V100 GPUs. Unfortunately each V100 has only 16GB memory. I have applied INT8 weight-only quantization, so...
https://arxiv.org/abs/2404.15420 "In-context learning (ICL) approaches typically leverage prompting to condition decoder-only language model generation on reference information. Just-in-time processing of a context is inefficient due to the quadratic cost of...
### System Info CPU Architecture: AMD EPYC 7V13 64-Core Processor CPU/Host memory size: 440 GPU properties: A100 80Gb GPU name: NVIDIA A100 80GB x2 GPU mem size: 80Gb x 2...