TensorRT-LLM llama awq4 result is wrong!

System Info

GPU:L40S
Tensorrt-llm:0.11.0.dev2024060400
cuda:cuda_12.4.r12.4/compiler.34097967_0
driver:535.129.03
os:DISTRIB_DESCRIPTION="Ubuntu 22.04.4 LTS"(docker)

Who can help?

No response

Information

[x] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[x] My own task or dataset (give details below)

Reproduction

# my mmproject need 576*2
MAX_MULTIMODAL_LEN=$((576 *2* BATCH_SIZE))                                                                                                                               
function llama_to_engine(){                                                                                                                                              
    model_name=$1                                                                                                                                                        
    if [ ! -d ${CHECKPOINT_PATH} ];then                                                                                                                                  
       python $convert --model_dir $HF_MODEL \                                                                                                                           
                                --output_dir $CHECKPOINT_PATH \                                                                                                          
                                --dtype $dtype \                                                                                                                         
                                --tp_size 1                                                                                                                              
    fi                                                                                                                                                                   
    if [ ! -d ${ENGINE_PATH} ];then                                                                                                                                      
        message "尝试转换checkpoint为TensorRT Engine"                                                                                                                    
        trtllm-build --checkpoint_dir $CHECKPOINT_PATH \                                                                                                                 
                    --output_dir $ENGINE_PATH \                                                                                                                          
                    --enable_debug_output \                                                                                                                              
                    --gemm_plugin float16 \                                                                                                                              
                    --use_fused_mlp \                                                                                                                                    
                    --max_batch_size $BATCH_SIZE \                                                                                                                       
                    --max_input_len 2048 \                                                                                                                               
                    --max_output_len 512 \                                                                                                                               
                    --gather_all_token_logits \                                                                                                                          
                    --max_multimodal_len $MAX_MULTIMODAL_LEN \                                                                                                           
                    --gemm_plugin $dtype                                                                                                                                 
    fi                                                                                                                                                                   
function llama_to_quant_engine(){                                                                                                                                        
    model_name=$1                                                                                                                                                        
    checkpoint_dir=$2                                                                                                                                                    
    engine_dir=$3                                                                                                                                                        
                                                                                                                                                                         
    if [ ! -d ${checkpoint_dir} ];then                                                                                                                                   
    message "Try to extract $model_name to $checkpoint_dir"                                                                                                              
    python $quantize --model_dir $model_name \                                                                                                                           
            --output_dir $checkpoint_dir \                                                                                                                               
            --dtype $dtype \                                                                                                                                             
            --qformat int4_awq \
            # --qformat w4a8_awq \
            --awq_block_size 128 \
            --calib_size 32
    fi

    if [ ! -d ${engine_dir} ];then
    message "Try to convert $checkpoint_dir to $engine_dir"
    trtllm-build --checkpoint_dir $checkpoint_dir \
            --output_dir $engine_dir \
            --gpt_attention_plugin $dtype \
            --gemm_plugin $dtype \
            --max_batch_size 8 \
            --max_input_len 2048 \
            --paged_kv_cache enable
            else
                message "${engine_dir} 已经存在"
    fi
}
# Inference is similar to llava

I trained a new llava model using vit, llama3, and my own mmprojectors. I found that after converting to tensorrt, the output under float16 was almost identical to hf, but after quantization, the results were completely incorrect. I ran some metrics on the evaluation set, and the metrics dropped from 80% to 20%. I want to know what caused this difference. image:https://github.com/360CVGroup/360VL/blob/master/docs/008.jpg prompt:请描述一下图片的内容 float16（right result）：

这张照片的特点是一个年轻的亚洲男性角色，可能来自动画或漫画，穿着正式的服装，包括一件蓝色夹克，白色衬衫和红色领结。他的头发是黑色的，向后梳着，戴着一副黑框眼镜。这个角
  有一个突出的鼻子和一个紧张的表情，表现为他睁大的眼睛和微张的嘴巴。他的眼睛是蓝色的，眉毛上扬，给人一种惊讶或关切的感觉。背景是模糊
```
w4a8_awq(wrong)：
```
这张照片描绘了一个充满活力的城市景观，高耸的摩天轮和现代化的建筑物体占据了天际。天际线上，各种交通工具如飞机、火车和汽车在运送着人们。下面，熙熙攘攘的街道上，人们在忙着各种活动。
```
int4_awq(wrong):
```
这张照片显示了一个宁静的湖泊，湖面平静，周围绿树。
```

My Questions:
1. I noticed that the dataset used for quantitative calibration is cnd_dailymail, and I am not sure if it is reasonable to use it to calibrate a multimodal model. If it is not reasonable,how should we construct a calibration dataset to calibrate llm? After quantifying LLM with AutoAWQ at 4 bits, I found that the output of the same problem before and after quantization 2. is almost identical. Is there any way to convert AutoAWQ's model into an Engine?
3. Is TensorRT's AWQ algorithm and AutoAWQ algorithm exactly the same?

Jun 28 '24 03:06 bleedingfight

@Barry-Delaney could u please take a look this issue?

Jun 28 '24 08:06 nv-guomingz

@bleedingfight thanks for the feedback. I think one possible reason is incorrect calibration dataset for multimodal tasks. Regarding your questions: A1. For using your own calibration datasets, please pass the directory through calib_dataset. A2. For support AutoAWQ produced checkpoints, please refer to the GPTQ support in the load_weights_from_gptq function, as they are supposed to be similar. A3. I think this question is left for modelopt team. cc @nv-guomingz.

Jun 28 '24 09:06 Barry-Delaney

@Barry-Delaney What I want to know is how to construct this dataset? My training data is a image text pair, but I only quantify LLM, so theoretically, I should only calibrate LLM. Is my calibration data an embedding input for llama? Or is it the part where the text is extracted from the image text alignment?

Jun 29 '24 00:06 bleedingfight

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

Aug 01 '24 01:08 github-actions[bot]

I also face the same issue. the reason may as follows: https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/llm_ptq/README.md#model-support-list

Aug 12 '24 05:08 white-wolf-tech

The documented moved to https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/examples/llm_ptq/README.md

I'll mark this as "waiting for feedback" so it can be automatically marked as stale if no feedback is received within 14 days. Simply leaving any comment will prevent the stale process from happening.

Aug 20 '25 23:08 karljang

I also face the same issue. the reason may as follows: https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/llm_ptq/README.md#model-support-list

thanks for your reply，I have already used VLLM AutoAWQ。

Aug 25 '25 02:08 bleedingfight

@bleedingfight , thank you for the update. Just to confirm my understanding: after using vLLM with AutoAWQ and the same models, you’re no longer seeing the issue you reported with TensorRT-LLM, is that correct?

Aug 25 '25 06:08 karljang

@bleedingfight , thank you for the update. Just to confirm my understanding: after using vLLM with AutoAWQ and the same models, you’re no longer seeing the issue you reported with TensorRT-LLM, is that correct?

Yes，The autoawq algorithm using vllm is even faster。https://github.com/NVIDIA/TensorRT-LLM/issues/1123

Aug 28 '25 02:08 bleedingfight

Thanks for confirming it!

Aug 28 '25 03:08 karljang

Issue has not received an update in over 14 days. Adding stale label.

Oct 07 '25 03:10 github-actions[bot]

This issue was closed because it has been 14 days without activity since it has been marked as stale.

Oct 21 '25 03:10 github-actions[bot]