TensorRT-LLM issues

Segmentation fault (core dumped)

2

[TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024050700 [TensorRT-LLM][INFO] Engine version 0.10.0.dev2024050700 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'cross_attention' not found [TensorRT-LLM][WARNING] Optional value for...

LIUKAI0815

triaged

neeed more info

Llama-2 13B SmoothQuant W8A8 Per-Tensor TP-4 performance is poor in v0.9.0 release

7

### System Info GPUs: A100, 4 GPUs (40 GB memory) Release: tensorrt-llm 0.9.0 ### Who can help? @Tracin ### Information - [X] The official example scripts - [ ] My...

vnkc1

question

triaged

not a bug

failed to use "stop_words_list" for tensorrt-llm==0.9.0

4

i use GenerationExecutorWorker for web service, using the parameters stop_words_list = [["hello, yes"]] by modifying the as_inference_request function in exectutor.py as follow: the ir parameter as follow: ![image](https://github.com/NVIDIA/TensorRT-LLM/assets/99712469/15256616-a4d2-4d2a-8419-1fa9b0835d63) then failed

AGI-player

triaged

neeed more info

gptManagerBenchmark seems to go into a dead loop with GPU usage 0%

1

I run this on GPU: 2 * A30 with CUDA driver 535.104.12. The docker image is built using `make -C docker release_build CUDA_ARCHS="80-real"` I use the latest code in branch...

sleepwalker2017

Unsupported auto parallel + int4 quantization on models

1

### System Info Tensorrt-LLM rel 0.9.0 ### Who can help? @Tracin ### Information - [X] The official example scripts - [ ] My own modified scripts ### Tasks - [X]...

Hudayday

bug

triaged

Bump gradio from 3.40.1 to 4.19.2 in /examples/qwen

Bumps [gradio](https://github.com/gradio-app/gradio) from 3.40.1 to 4.19.2. Release notes Sourced from gradio's releases. @gradio/model3d@0.10.4 Dependency updates @gradio/client@0.19.3 @gradio/statustracker@0.5.4 @gradio/upload@0.10.4 @gradio/model3d@0.10.3 Dependency updates @gradio/upload@0.10.3 @gradio/client@0.19.2 @gradio/model3d@0.10.1 Fixes #8252 22df61a - Client node...

dependabot[bot]

dependencies

enable medusa int8 weight only quantization

2

XiaobingSuper

Increase chunk size while streaming

1

Is it possible to increase the amount of tokens sent per chunk during the streaming process and how to do so? This could also be with triton-inference-server

avianion

question

triaged

【Bug Report】llama v3 70B int4 reasoning abnormal

3

### System Info GPU name (NVIDIA A6000) TensorRT-LLM tage (v0.9.0 main) transformers tage (0.41.0) ### Who can help? @nc ### Information - [X] The official example scripts - [X] My...

vip-china

bug

convert qwen 110b gptq checkpoint的时候，qkv_bias 的shape不能被3整除

2

![094c99ee1cd6bcfd56a550c1a68d80c2](https://github.com/NVIDIA/TensorRT-LLM/assets/57712520/4cb57a97-bee3-4bc6-ab09-e6779f0fda76)

CallmeZhangChenchen

TensorRT-LLM
TensorRT-LLM copied to clipboard

Metadata

Segmentation fault (core dumped)

Llama-2 13B SmoothQuant W8A8 Per-Tensor TP-4 performance is poor in v0.9.0 release

failed to use "stop_words_list" for tensorrt-llm==0.9.0

gptManagerBenchmark seems to go into a dead loop with GPU usage 0%

Unsupported auto parallel + int4 quantization on models

Bump gradio from 3.40.1 to 4.19.2 in /examples/qwen

enable medusa int8 weight only quantization

Increase chunk size while streaming

【Bug Report】llama v3 70B int4 reasoning abnormal

convert qwen 110b gptq checkpoint的时候，qkv_bias 的shape不能被3整除

← Metadata

Owner

Metadata

TensorRT-LLM TensorRT-LLM copied to clipboard

Metadata

← Metadata

Owner

Metadata

TensorRT-LLM
TensorRT-LLM copied to clipboard