tensorrtllm_backend issues

Using Bert/Roberta with "tensorrtllm" backend directly ? (no Python lib like tensorrt-llm package)

9

### System Info - Ubuntu - GPU A100 / 3090 RTX - docker nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3 - Python tensorrt-llm package (version 0.9.0.dev2024030500) installed in the docker image (no other installation) ### Who...

pommedeterresautee

bug

triaged

Expected batch dimension to be 1 for each request for input_ids

4

### System Info **Hardware:** - CPU architecture: x86_64 - CPU memory size: - L1d cache: 2 MiB - L1i cache: 2 MiB - L2 cache: 64 MiB - L3 cache:...

calvinh99

bug

triaged

convert_checkpoint.py not working - safetensors_rust.SafetensorError: Error while deserializing header: InvalidHeaderDeserialization

3

### System Info While building TensorRT engines for Mixtral model Mixtral-8x7B-Instruct-v0.1, ran into this error. Loading checkpoint shards: 21%|██████████████████████████████████▌ | 4/19 [05:30

saurabhbhagwat

bug

triaged

sreaming mode doesn't work

2

### System Info V100*2 nvcr.io/nvidia/tritonserver:24.01-trtllm-python-py3 tensorrt-llm 0.7.0 ### Who can help? _No response_ ### Information - [X] The official example scripts - [ ] My own modified scripts ### Tasks...

dongteng

bug

curl error - triton deployment inference

9

Once I have correcly deployed my model on a triton server, once I try to send a request: `curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 100,...

HalteroXHunter

Getting gemmPlugin error for mixtral model on v0.8.0 while hosting on triton server

1

### System Info - GPU: 2 x Nvidia A100 80GB ![image (4)](https://github.com/triton-inference-server/tensorrtllm_backend/assets/142644506/ef227a48-4094-4df5-8daf-2917b6bf6627) ### Who can help? _No response_ ### Information - [X] The official example scripts - [ ] My...

sarthak-phatate

bug

triaged

Infer failed: Unable to parse 'data': Shape does not match true shape of 'data' field in generate endpoint

1

### System Info - CPU architecture: `x86_64` - GPU: NVIDIA A10 24GB - TensorRT-LLM: `v0.8.0` (docker build via `make -C docker release_build CUDA_ARCHS="86-real"`) - Triton Inference Server: `r24.02` (docker from...

bprus

bug

Memory avalable for KV using Triton TRT-LLM backed is lower than using TRT-LLM directly

3

### System Info ec2 instance - g5.12xlarge ami - ami-0d8667b0f72471655 ### Who can help? Hi, I'm writing to ask about a discrepancy I'm seeing when trying to run mistral-7b on...

UnyieldingOrca

bug

triaged

Typo in README decoupled mode: Make text consistent for boolean variable in README.

1

https://github.com/triton-inference-server/tensorrtllm_backend/blob/49def341ca37e0db3dc8c80c99da824107a7a938/README.md?plain=1#L231 Make text consistent for boolean variable in README. Likely should be `true` not `True` ``` Optional (default=false). Controls streaming. Decoupled mode must be set to True if using the...

esnvidia

documentation

triaged

[BUG] Missing `tokenizer_type` parameter to config.pbtxt

2

https://github.com/triton-inference-server/tensorrtllm_backend/blob/49def341ca37e0db3dc8c80c99da824107a7a938/all_models/inflight_batcher_llm/preprocessing/config.pbtxt#L127 tokenizer_type parameter is missing in the config.pbtxt yet is described in the README as a parameter to use. Please add the tokenizer_type in the relevant config.pbtxt files by default.

esnvidia

documentation

triaged

tensorrtllm_backend
tensorrtllm_backend copied to clipboard

Metadata

Using Bert/Roberta with "tensorrtllm" backend directly ? (no Python lib like tensorrt-llm package)

Expected batch dimension to be 1 for each request for input_ids

convert_checkpoint.py not working - safetensors_rust.SafetensorError: Error while deserializing header: InvalidHeaderDeserialization

sreaming mode doesn't work

curl error - triton deployment inference

Getting gemmPlugin error for mixtral model on v0.8.0 while hosting on triton server

Infer failed: Unable to parse 'data': Shape does not match true shape of 'data' field in generate endpoint

Memory avalable for KV using Triton TRT-LLM backed is lower than using TRT-LLM directly

Typo in README decoupled mode: Make text consistent for boolean variable in README.

[BUG] Missing `tokenizer_type` parameter to config.pbtxt

← Metadata

Owner

Metadata

tensorrtllm_backend tensorrtllm_backend copied to clipboard

Metadata

← Metadata

Owner

Metadata

tensorrtllm_backend
tensorrtllm_backend copied to clipboard