tensorrtllm_backend icon indicating copy to clipboard operation
tensorrtllm_backend copied to clipboard

The Triton TensorRT-LLM Backend

Results 251 tensorrtllm_backend issues
Sort by recently updated
recently updated
newest added

there is an example https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/qwenvl , but I have no idea how can I use this model in triton server, Can you provide an example of a visual language model...

triaged

TensorRT-LLM Backend I have built via docker. But the size of docker image is too big than the image in NGC. How to decrease the size? ![捕获](https://github.com/triton-inference-server/tensorrtllm_backend/assets/26588466/50dbe936-7551-4831-ab07-07faa541d66b) this is the...

triaged

### System Info x86 Ryzen 5950x v0.8.0 ubuntu 22.04 Rtx 3090 ### Who can help? @byshiue @schetlur-nv ### Information - [X] The official example scripts - [ ] My own...

bug

When I use tensorrt_llm_bls, the first token takes very long time. It looks like the queue is blocked. Use tensorrt_llm and ensemle didn't encounter this problem How should I troubleshoot...

triaged

### System Info - CPU: amd64 - OS: Debian 12 - GPU: nvidia rtx4000 ada - GPU driver: 535.161 - TensorRT-LLM version: 0.8 - tensorrtllm_backend version: 0.8 ### Who can...

triaged
need more info

### System Info CPU Architecture: AMD EPYC 7V13 64-Core Processor CPU/Host memory size: 440 GPU properties: A800 80GB GPU name: NVIDIA A800 80GB x2 GPU mem size: 80Gb x 2...

triaged

I Need a no-Docker version ,case when I have no root privileges. So, How to build TensorRT-LLM Backend without Docker I find no build.py in code, and cmake the file...

I am trying to deploy a Baichuan2-7B model on a machine with 2 Tesla V100 GPUs. Unfortunately each V100 has only 16GB memory. I have applied INT8 weight-only quantization, so...

question
triaged

https://arxiv.org/abs/2404.15420 "In-context learning (ICL) approaches typically leverage prompting to condition decoder-only language model generation on reference information. Just-in-time processing of a context is inefficient due to the quadratic cost of...

### System Info CPU Architecture: AMD EPYC 7V13 64-Core Processor CPU/Host memory size: 440 GPU properties: A100 80Gb GPU name: NVIDIA A100 80GB x2 GPU mem size: 80Gb x 2...

bug
triaged