tensorrtllm_backend issues

Can I use triton server tensorrtllm backend to host other tensorrt built models? If not what do you suggest if our models stack is mixed of LLM and non-LLM models

9

Hello, so our current models stack consists of a set of models built in TensorRT and the whisper ASR model. I'd like to use triton server to host all these...

zmy1116

triaged

Update the multinode tutorial link

Deprecate old multinode tutorial link that is no longer relevant

harryskim

draft pr about non-streaming output

1

Try to get a valid output of the model, i.e. do not include the input of the model in the case of non-streaming

BasicCoder

Support non-detached mode for python trtllm backend

### System Info tensorrtllm backend doesn't work for us because of this bug: https://github.com/triton-inference-server/tensorrtllm_backend/issues/598. So I have to use [python backend](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/all_models/inflight_batcher_llm/tensorrt_llm/1/model.py). However, it only supports detached model which we don't...

ShuaiShao93

bug

Stark Difference in GPU Usage of Triton Servers with Llama3 and Llama3.1 models

1

**Description** I have noticed that there was a huge difference in memory usage for runtime buffers and decoder for llama3 and llama3.1. **Triton Information** What version of Triton are you...

jasonngap1

The GPU memory usage is too high.

1

### System Info cpu intel 14700k gpu rtx 4090 tensorrt_llm 0.13 docker tritonserver:24.09-trtllm-python-py3 ### Who can help? @Tracin ### Information - [X] The official example scripts - [ ] My...

imilli

bug

the output of bls is unstable

### System Info Ubuntu 22.04 Triton image: nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3 and the version of trtllm-backend is 0.10.0 Model: qwen2-7b-instruct ### Who can help? _No response_ ### Information - [ ] The official...

dwq370

bug

Update launch_triton_server.py

typo correction in launch_triton_server.py

ankur1-samsung

Failed install in nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3

### System Info NVIDIA-SMI 535.104.12 Driver Version: 535.104.12 CUDA Version: 12.5 base docker image: tritonserver:24.08-trtllm-python-py3 ### Who can help? I want install torchaudio in ` tritonserver:24.08-trtllm-python-py3`. A conflict occurred between...

wwx007121

bug

make 2 instance.

I have 2 3090. I want to launch tensorrt on every gpu separately. For this i have to change several things: 1- in tensorrt_llm/config.pbtxt: ``` instance_group { count: 0 kind:...

Alireza3242

tensorrtllm_backend
tensorrtllm_backend copied to clipboard

Metadata

Can I use triton server tensorrtllm backend to host other tensorrt built models? If not what do you suggest if our models stack is mixed of LLM and non-LLM models

Update the multinode tutorial link

draft pr about non-streaming output

Support non-detached mode for python trtllm backend

Stark Difference in GPU Usage of Triton Servers with Llama3 and Llama3.1 models

The GPU memory usage is too high.

the output of bls is unstable

Update launch_triton_server.py

Failed install in nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3

make 2 instance.

← Metadata

Owner

Metadata

tensorrtllm_backend tensorrtllm_backend copied to clipboard

Metadata

← Metadata

Owner

Metadata

tensorrtllm_backend
tensorrtllm_backend copied to clipboard