tensorrtllm_backend issues

Can't build docker image

2

### System Info Ryzen 5950x, Ubuntu 22.04, 2 RTX 3090s, main branch ### Who can help? @byshiue @sch ### Information - [X] The official example scripts - [ ] My...

mallorbc

bug

triaged

Encountered error: [StatusCode.INVALID_ARGUMENT] [request id: <id_unknown>] inference input 'end_id' data-type is 'INT32', but model 'tensorrt_llm' expects 'UINT32'

3

### System Info Linux devserver-ei 5.4.0-144-generic #161-Ubuntu SMP Fri Feb 3 14:49:04 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux ![image](https://github.com/triton-inference-server/tensorrtllm_backend/assets/13193324/5a7d72e9-d729-4bec-95ae-23eea2967aa2) nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation...

huaiguang

bug

Triton Server for Mixtral fails non-deterministically with a boost exception error

3

### System Info CPU - x86_64, Intel(R) Xeon(R) CPU @ 2.20GHz CPU memory - 1.3TB GPUs - Nvidia A100 80GB git commit ID of (TensorRT LLM backend): e432c6a0cc85f9790365067e7e3175e1b2ce3559 TRT-LLM version:...

vinod-sarvam

bug

server fails in Stuck when using pipeline parallel in multi-nodes

2

### System Info 2 * 4 L40s load llama2-70B, 1 model: tensorrt_llm. using image: nvcr.io/nvidia/tritonserver:23.11-trtllm-python-py3 ### Who can help? _No response_ ### Information - [X] The official example scripts -...

hezeli123

bug

gpt example is not working on me (Tensor names in engine are not the same as expected error)

1

### System Info - x86_64 cpu - A100 80GB * 2 - using docker container is followed https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#using-the-tensorrt-llm-backend way ### Who can help? @byshiue @schetlur-nv ### Information - [X] The...

YooSungHyun

bug

cuda-keyring_1.0-1_all.deb Not FOUND

1

build dockerfile, command :"curl -o /tmp/cuda-keyring.deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/$arch/cuda-keyring_1.0-1_all.deb " is not found, open the url is 404 the branch ：master docekerfile : dockerfile/Dockerfile.triton.trt_llm_backend

coderchem

triaged

support OrionStar llm?

Is this repo support Orion-14B-xx ? Such as: [Orion-14B-LongChat](https://huggingface.co/OrionStarAI/Orion-14B-LongChat)

dyyzhmm

feature request

Inference speed did not improve when the chunked context was enabled. (Mistral)

### System Info **System** CPU architecture X86_64 (EC2 G5.12xlarge) CPU/Host memory size 192GB, 8GB Swap GPU properties GPU name A10Gx4 GPU memory size 24GBx4 (96GB) **Libraries** TensorRT-LLM main TensorRT-LLM [https://github.com/NVIDIA/TensorRT-LLM/commit/0ab9d17a59c284d2de36889832fe9fc7c8697604](https://github.com/NVIDIA/TensorRT-LLM/commit/0ab9d17a59c284d2de36889832fe9fc7c8697604)...

matichon-vultureprime

bug

How does Triton implement batch inference

1

In the [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) build.py `parser.add_argument('--max_batch_size', type=int, default=10)` However, when Triton calls the code, `client/inflight_batcher_llm_client.py`, it sends grpc requests at the same time, accepts them and returns them. How does it...

lyc728

triaged

Last token repeat after adding end_id for Baichuan2-13B-Chat

2

Build the engine using the following code (TensorRT-LLM version v0.7.1) for Baichuan2-13B-Chat model. python build.py --model_version v2_13b \ --model_dir /code/tensorrt_llm/checkpoint/hf_bf16 \ --dtype float16 \ --use_gemm_plugin float16 \ --use_gpt_attention_plugin float16 \...

Chevolier

tensorrtllm_backend
tensorrtllm_backend copied to clipboard

Metadata

Can't build docker image

Encountered error: [StatusCode.INVALID_ARGUMENT] [request id: <id_unknown>] inference input 'end_id' data-type is 'INT32', but model 'tensorrt_llm' expects 'UINT32'

Triton Server for Mixtral fails non-deterministically with a boost exception error

server fails in Stuck when using pipeline parallel in multi-nodes

gpt example is not working on me (Tensor names in engine are not the same as expected error)

cuda-keyring_1.0-1_all.deb Not FOUND

support OrionStar llm?

Inference speed did not improve when the chunked context was enabled. (Mistral)

How does Triton implement batch inference

Last token repeat after adding end_id for Baichuan2-13B-Chat

← Metadata

Owner

Metadata

tensorrtllm_backend tensorrtllm_backend copied to clipboard

Metadata

← Metadata

Owner

Metadata

tensorrtllm_backend
tensorrtllm_backend copied to clipboard