TensorRT-LLM
TensorRT-LLM copied to clipboard
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficientl...
### System Info - x86_64 - NVIDIA H20 - 96GB - TensorRT-LLM version: 0.11.0.dev2024051400 ### Who can help? @Tracin ### Information - [X] The official example scripts - [ ]...
### System Info - Nvidia A40 - CUDA 12.2 - TensorRT 10.0.1.6 - TensorRT-LLM 0.10.0.dev2024050700 ### Who can help? @byshiue ### Information - [X] The official example scripts - [X]...
I built the engines for T5 model with the following scripts for the latest version of TensorRT-LLM: ``` export MODEL_DIR="path_to_t5_model" # or "flan-t5-small" export MODEL_NAME="t5model" export MODEL_TYPE="t5" export INFERENCE_PRECISION="float16" export...
### System Info tensorrt 10.0.1 tensorrt-cu12 10.0.1 tensorrt-cu12-bindings 10.0.1 tensorrt-cu12-libs 10.0.1 tensorrt-llm 0.10.0.dev2024050700 A100 40G ### Who can help? @byshiue ### Information - [X] The official example scripts - [...
I need to test the benchmark of different models, but it does not in the allowed_configs.py. How to do it? Thanks
Hello, I deployed the model based on examples/qwenvl/README.md, but the model inference result of running run.py was incorrect. What is the problem? > Input: "[{'image': './pics/demo.jpeg'}, {'text': 'Describe the picture'}]"...
Executor api introduces Leader and Orchestrator modes. Leader works via mpi. How Orchestrator mode is implemented? Does it uses mpi itself? Which mode is preferable for performance: Leader or Orchestrator?
Could you guys share rough timeline on the support of FP8 quantization for Mixtral (MoE) model? cc: @Tracin
Following up on Cogvlm, CogVlm2 is here: https://github.com/THUDM/CogVLM2 Easily one of the best open-source multimodal model, that is competitive to GPT-4 and Gemini. https://github.com/THUDM/CogVLM2?tab=readme-ov-file#benchmark The community would be grateful for...
Hi, I would like to know when to use `RowLinear` and `ColumnLinear`. I see it used in conjuction in `mlp.py` and `attention.py` and I'm finding it difficult to know what's...