TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficientl...

Results 937 TensorRT-LLM issues
Sort by recently updated
recently updated
newest added

[TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024050700 [TensorRT-LLM][INFO] Engine version 0.10.0.dev2024050700 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'cross_attention' not found [TensorRT-LLM][WARNING] Optional value for...

triaged
neeed more info

### System Info GPUs: A100, 4 GPUs (40 GB memory) Release: tensorrt-llm 0.9.0 ### Who can help? @Tracin ### Information - [X] The official example scripts - [ ] My...

question
triaged
not a bug

i use GenerationExecutorWorker for web service, using the parameters stop_words_list = [["hello, yes"]] by modifying the as_inference_request function in exectutor.py as follow: the ir parameter as follow: ![image](https://github.com/NVIDIA/TensorRT-LLM/assets/99712469/15256616-a4d2-4d2a-8419-1fa9b0835d63) then failed

triaged
neeed more info

I run this on GPU: 2 * A30 with CUDA driver 535.104.12. The docker image is built using `make -C docker release_build CUDA_ARCHS="80-real"` I use the latest code in branch...

### System Info Tensorrt-LLM rel 0.9.0 ### Who can help? @Tracin ### Information - [X] The official example scripts - [ ] My own modified scripts ### Tasks - [X]...

bug
triaged

Bumps [gradio](https://github.com/gradio-app/gradio) from 3.40.1 to 4.19.2. Release notes Sourced from gradio's releases. @​gradio/model3d@​0.10.4 Dependency updates @​gradio/client@​0.19.3 @​gradio/statustracker@​0.5.4 @​gradio/upload@​0.10.4 @​gradio/model3d@​0.10.3 Dependency updates @​gradio/upload@​0.10.3 @​gradio/client@​0.19.2 @​gradio/model3d@​0.10.1 Fixes #8252 22df61a - Client node...

dependencies

Is it possible to increase the amount of tokens sent per chunk during the streaming process and how to do so? This could also be with triton-inference-server

question
triaged

### System Info GPU name (NVIDIA A6000) TensorRT-LLM tage (v0.9.0 main) transformers tage (0.41.0) ### Who can help? @nc ### Information - [X] The official example scripts - [X] My...

bug

![094c99ee1cd6bcfd56a550c1a68d80c2](https://github.com/NVIDIA/TensorRT-LLM/assets/57712520/4cb57a97-bee3-4bc6-ab09-e6779f0fda76)