TensorRT-LLM
TensorRT-LLM copied to clipboard
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficientl...
[TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024050700 [TensorRT-LLM][INFO] Engine version 0.10.0.dev2024050700 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'cross_attention' not found [TensorRT-LLM][WARNING] Optional value for...
### System Info GPUs: A100, 4 GPUs (40 GB memory) Release: tensorrt-llm 0.9.0 ### Who can help? @Tracin ### Information - [X] The official example scripts - [ ] My...
i use GenerationExecutorWorker for web service, using the parameters stop_words_list = [["hello, yes"]] by modifying the as_inference_request function in exectutor.py as follow: the ir parameter as follow:  then failed
I run this on GPU: 2 * A30 with CUDA driver 535.104.12. The docker image is built using `make -C docker release_build CUDA_ARCHS="80-real"` I use the latest code in branch...
### System Info Tensorrt-LLM rel 0.9.0 ### Who can help? @Tracin ### Information - [X] The official example scripts - [ ] My own modified scripts ### Tasks - [X]...
Bumps [gradio](https://github.com/gradio-app/gradio) from 3.40.1 to 4.19.2. Release notes Sourced from gradio's releases. @gradio/model3d@0.10.4 Dependency updates @gradio/client@0.19.3 @gradio/statustracker@0.5.4 @gradio/upload@0.10.4 @gradio/model3d@0.10.3 Dependency updates @gradio/upload@0.10.3 @gradio/client@0.19.2 @gradio/model3d@0.10.1 Fixes #8252 22df61a - Client node...
Is it possible to increase the amount of tokens sent per chunk during the streaming process and how to do so? This could also be with triton-inference-server
### System Info GPU name (NVIDIA A6000) TensorRT-LLM tage (v0.9.0 main) transformers tage (0.41.0) ### Who can help? @nc ### Information - [X] The official example scripts - [X] My...
