tensorrtllm_backend issues

Can you provide an example of a visual language model or multimodal model launch by triton server?

8

there is an example https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/qwenvl , but I have no idea how can I use this model in triton server, Can you provide an example of a visual language model...

lzcchl

triaged

Build via Docker is much big than the image in NGC

5

TensorRT-LLM Backend I have built via docker. But the size of docker image is too big than the image in NGC. How to decrease the size? ![捕获](https://github.com/triton-inference-server/tensorrtllm_backend/assets/26588466/50dbe936-7551-4831-ab07-07faa541d66b) this is the...

ZJU-lishuang

triaged

Can't build docker image with Ryzen 5950x

2

### System Info x86 Ryzen 5950x v0.8.0 ubuntu 22.04 Rtx 3090 ### Who can help? @byshiue @schetlur-nv ### Information - [X] The official example scripts - [ ] My own...

mallorbc

bug

[Bug]When I use tensorrt_llm_bls, the first token takes very long time.

10

When I use tensorrt_llm_bls, the first token takes very long time. It looks like the queue is blocked. Use tensorrt_llm and ensemle didn't encounter this problem How should I troubleshoot...

wjj19950828

triaged

InFlightBatching seems not working

3

### System Info - CPU: amd64 - OS: Debian 12 - GPU: nvidia rtx4000 ada - GPU driver: 535.161 - TensorRT-LLM version: 0.8 - tensorrtllm_backend version: 0.8 ### Who can...

larme

triaged

need more info

decoding_mode top_k_top_p does not take effect for llama2 not same with huggingface

3

### System Info CPU Architecture: AMD EPYC 7V13 64-Core Processor CPU/Host memory size: 440 GPU properties: A800 80GB GPU name: NVIDIA A800 80GB x2 GPU mem size: 80Gb x 2...

yjjiang11

triaged

How to build TensorRT-LLM Backend(tensorrtllm_backend) without Docker

2

I Need a no-Docker version ，case when I have no root privileges. So, How to build TensorRT-LLM Backend without Docker I find no build.py in code, and cmake the file...

veridone

How to deploy one model instance across multiple GPUs to tackle the OOM problem?

8

I am trying to deploy a Baichuan2-7B model on a machine with 2 Tesla V100 GPUs. Unfortunately each V100 has only 16GB memory. I have applied INT8 weight-only quantization, so...

shil3754

question

triaged

Implement XC-Cache to improve long context inference performance

1

https://arxiv.org/abs/2404.15420 "In-context learning (ICL) approaches typically leverage prompting to condition decoder-only language model generation on reference information. Just-in-time processing of a context is inefficient due to the quadratic cost of...

avianion

Tritonserver won't start up running Smaug 34b

1

### System Info CPU Architecture: AMD EPYC 7V13 64-Core Processor CPU/Host memory size: 440 GPU properties: A100 80Gb GPU name: NVIDIA A100 80GB x2 GPU mem size: 80Gb x 2...

workuser12345

bug

triaged

tensorrtllm_backend
tensorrtllm_backend copied to clipboard

Metadata

Can you provide an example of a visual language model or multimodal model launch by triton server?

Build via Docker is much big than the image in NGC

Can't build docker image with Ryzen 5950x

[Bug]When I use tensorrt_llm_bls, the first token takes very long time.

InFlightBatching seems not working

decoding_mode top_k_top_p does not take effect for llama2 not same with huggingface

How to build TensorRT-LLM Backend(tensorrtllm_backend) without Docker

How to deploy one model instance across multiple GPUs to tackle the OOM problem?

Implement XC-Cache to improve long context inference performance

Tritonserver won't start up running Smaug 34b

← Metadata

Owner

Metadata

tensorrtllm_backend tensorrtllm_backend copied to clipboard

Metadata

← Metadata

Owner

Metadata

tensorrtllm_backend
tensorrtllm_backend copied to clipboard