TensorRT-LLM issues

Error: FP8 quantize Integer divide-by-zero

### System Info - x86_64 - NVIDIA H20 - 96GB - TensorRT-LLM version: 0.11.0.dev2024051400 ### Who can help? @Tracin ### Information - [X] The official example scripts - [ ]...

BasicCoder

bug

Top-P sampling occasionally produces invalid tokens

2

### System Info - Nvidia A40 - CUDA 12.2 - TensorRT 10.0.1.6 - TensorRT-LLM 0.10.0.dev2024050700 ### Who can help? @byshiue ### Information - [X] The official example scripts - [X]...

AlessioNetti

bug

triaged

Can not change max_input_len of Encoder while building engine in Encoder-Decoder model (T5)

I built the engines for T5 model with the following scripts for the latest version of TensorRT-LLM: ``` export MODEL_DIR="path_to_t5_model" # or "flan-t5-small" export MODEL_NAME="t5model" export MODEL_TYPE="t5" export INFERENCE_PRECISION="float16" export...

thanhlt998

Fail to build Llama-3-70B-Instruct with w4a16

11

### System Info tensorrt 10.0.1 tensorrt-cu12 10.0.1 tensorrt-cu12-bindings 10.0.1 tensorrt-cu12-libs 10.0.1 tensorrt-llm 0.10.0.dev2024050700 A100 40G ### Who can help? @byshiue ### Information - [X] The official example scripts - [...

gloritygithub11

bug

triaged

How to test the benchmark of Llama3 and Vicuna2 of TensorRT-LLM by benchmark.py

4

I need to test the benchmark of different models, but it does not in the allowed_configs.py. How to do it? Thanks

Ourspolaire1

Qwen-VL inference errors

1

Hello, I deployed the model based on examples/qwenvl/README.md, but the model inference result of running run.py was incorrect. What is the problem? > Input: "[{'image': './pics/demo.jpeg'}, {'text': 'Describe the picture'}]"...

jdmdj1999

Question about Orchestrator mode

3

Executor api introduces Leader and Orchestrator modes. Leader works via mpi. How Orchestrator mode is implemented? Does it uses mpi itself? Which mode is preferable for performance: Leader or Orchestrator?

akhoroshev

triaged

When will FP8 be available for Mixtral?

11

Could you guys share rough timeline on the support of FP8 quantization for Mixtral (MoE) model? cc: @Tracin

Pernekhan

triaged

[Model Requests] Add support for CogVLM2

Following up on Cogvlm, CogVlm2 is here: https://github.com/THUDM/CogVLM2 Easily one of the best open-source multimodal model, that is competitive to GPT-4 and Gemini. https://github.com/THUDM/CogVLM2?tab=readme-ov-file#benchmark The community would be grateful for...

harry-stark

feature request

Question regarding RowLinear and ColumnLinear

Hi, I would like to know when to use `RowLinear` and `ColumnLinear`. I see it used in conjuction in `mlp.py` and `attention.py` and I'm finding it difficult to know what's...

Ashwin-Ramesh2607

TensorRT-LLM
TensorRT-LLM copied to clipboard

Metadata

Error: FP8 quantize Integer divide-by-zero

Top-P sampling occasionally produces invalid tokens

Can not change max_input_len of Encoder while building engine in Encoder-Decoder model (T5)

Fail to build Llama-3-70B-Instruct with w4a16

How to test the benchmark of Llama3 and Vicuna2 of TensorRT-LLM by benchmark.py

Qwen-VL inference errors

Question about Orchestrator mode

When will FP8 be available for Mixtral?

[Model Requests] Add support for CogVLM2

Question regarding RowLinear and ColumnLinear

← Metadata

Owner

Metadata

TensorRT-LLM TensorRT-LLM copied to clipboard

Metadata

← Metadata

Owner

Metadata

TensorRT-LLM
TensorRT-LLM copied to clipboard