TensorRT-LLM issues

How to set the initial kv cache length?

2

I want to test an example: the initial kv cache length is 2048, and LLM iterate 2048 times, so the output_tokens=2048, but the initial kv cache length is 2048, and...

liminn

question

triaged

[Feature Request] Support for Constrained Decoding (such as generating Json formatted output)

9

Summary I would like to propose the addition of constrained decoding support. This feature would allow the output sequence to be constrained by a Finite State Machine (FSM) or Context-Free...

silverriver

triaged

feature request

How to build int4_gptq on Mixtral 8x7b

3

I use following code to generate the checkpoint: ``` set -e export MODEL_DIR=/mnt/memory export MODEL_NAME=Mixtral-8x7B-Instruct-v0.1 export LD_LIBRARY_PATH=/usr/local/tensorrt/lib:$LD_LIBRARY_PATH export PATH=/usr/local/tensorrt/bin:$PATH export PRECISION=int4_gptq_a16 export QUANTIZE=int4_gptq export DTYPE=bfloat16 export PYTHONPATH=/app/tensorrt-llm:$PYTHONPATH python ../llama/convert_checkpoint.py \...

gloritygithub11

triaged

Fail to build int4_awq on Mixtral 8x7b

17

### System Info ubuntu 20.04 tensorrt 10.0.1 tensorrt-cu12 10.0.1 tensorrt-cu12-bindings 10.0.1 tensorrt-cu12-libs 10.0.1 tensorrt-llm 0.10.0.dev2024050700 ### Who can help? @Tracin ### Information - [X] The official example scripts - [...

gloritygithub11

triaged

feature request

quantization

not a bug

getPluginCreator could not find plugin: WeightOnlyQuantMatmultensorrt_llm

4

### System Info - A100 40G - tensorrt 10.0.1 - tensorrt-llm 0.10.0.dev2024050700 ### Who can help? @Tracin ### Information - [X] The official example scripts - [ ] My own...

gloritygithub11

bug

triaged

mpirun extra gpu memory consumption

1

1. Build mixtral for tp8 2. Run `mpirun -n 8 ./gptSessionBenchmark` 3. nvidia-smi shows ``` Wed May 15 09:13:31 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4...

akhoroshev

triaged

[Quantization] Long latency for generating first token

5

## Environment - RTX8000 GPU - TensorRT-LLM v0.9.0 ## Model - LLaVA v1.5 7B (LLaMA2 7B) - fp16 and int8/int4 weight quantization - batchsize = 16 ## Script - official...

youki-sada

question

triaged

getPluginCreator could not find plugin: Gemmtensorrt_llm version: 1

7

### System Info tensorrt 10.0.1 tensorrt-cu12 10.0.1 tensorrt-cu12-bindings 10.0.1 tensorrt-cu12-libs 10.0.1 tensorrt-llm 0.10.0.dev2024050700 ### Who can help? @byshiue ### Information - [X] The official example scripts - [ ] My...

gloritygithub11

triaged

Encounter CUDA error when increasing the length of input_ids

2

### System Info GPU: A800 GPU memory: 80G TensorRT-LLM: 0.8.0 CUDA: 12.1 OS: unbuntu ### Who can help? @byshiue @kaiyux ### Information - [ ] The official example scripts -...

1649759610

bug

Mixtral with TP hangs indefinitely if another process uses the same GPU with ONNX

3

### System Info - CPU architecture: x86_64 - GPU name: NVIDIA A40, 46GB - TensorRT-LLM: v0.9.0 - Os: Ubuntu 20.04 - Nvidia Driver: 535.54.03, Cuda: 12.2 ### Who can help?...

v-dicicco

bug

triaged

TensorRT-LLM
TensorRT-LLM copied to clipboard

Metadata

How to set the initial kv cache length?

[Feature Request] Support for Constrained Decoding (such as generating Json formatted output)

How to build int4_gptq on Mixtral 8x7b

Fail to build int4_awq on Mixtral 8x7b

getPluginCreator could not find plugin: WeightOnlyQuantMatmultensorrt_llm

mpirun extra gpu memory consumption

[Quantization] Long latency for generating first token

getPluginCreator could not find plugin: Gemmtensorrt_llm version: 1

Encounter CUDA error when increasing the length of input_ids

Mixtral with TP hangs indefinitely if another process uses the same GPU with ONNX

← Metadata

Owner

Metadata

TensorRT-LLM TensorRT-LLM copied to clipboard

Metadata

← Metadata

Owner

Metadata

TensorRT-LLM
TensorRT-LLM copied to clipboard