TensorRT-LLM Qwen-VL-Chat vit embedding diff

Problem

For same input image , I get different output of the visual embedding, and this could make the result a little bit worse than original model.

env

tensorrt-llm 0.9.0, GPU: A10 model: https://modelscope.cn/models/qwen/Qwen-VL-Chat/summary

Qwen-VL-Chat build command

MAX_BATCH_SIZE=8
HF_MODEL_DIR=$MODEL_ROOT_DIR/Qwen-VL-Chat
ONNX_FILE=$MODEL_ROOT_DIR/visual_encoder/visual_encoder.onnx
PLAN_FILE=$MODEL_ROOT_DIR/plan/visual_encoder/visual_encoder_fp16.plan
CHECKPOINT_DIR=$MODEL_ROOT_DIR/qwen_vl_trt/checkpoint
ENGINE_DIR=$MODEL_ROOT_DIR/qwen_vl_trt/engine
MAX_INPUT_LEN=2048
MAX_OUTPUT_LEN=1024
MAX_PROMPT_EMBEDDING_TABLE_SIZE=$((MAX_BATCH_SIZE * 256))

export CUDA_VISIBLE_DEVICES=1
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

if [ ! -f $PLAN_FILE ]; then
    CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES python3 vit_onnx_trt.py --pretrained_model_path $HF_MODEL_DIR \
                --onnxFile $ONNX_FILE --planFile $PLAN_FILE --maxBS $MAX_BATCH_SIZE
fi

CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES python3 ../qwen/convert_checkpoint.py --model_dir=$HF_MODEL_DIR --output_dir=$CHECKPOINT_DIR

CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES trtllm-build --checkpoint_dir=$CHECKPOINT_DIR \
             --gemm_plugin=float16 --gpt_attention_plugin=float16 \
             --lookup_plugin=float16 --max_input_len=$MAX_INPUT_LEN --max_output_len=$MAX_OUTPUT_LEN \
             --max_batch_size=$MAX_BATCH_SIZE --max_prompt_embedding_table_size=$MAX_PROMPT_EMBEDDING_TABLE_SIZE \
             --remove_input_padding=enable \
             --output_dir=$ENGINE_DIR

Tensorrt-llm Print input image and output embedding:

stream = torch.cuda.current_stream().cuda_stream
image_npy = self.image_preproc.encode([image_path])  # Preprocess
images = torch.cat([image_npy]).to(self.device)  #([bs, 3, 448, 448])
batch_size = images.size(0)
images = images.expand(batch_size, -1, -1, -1).contiguous()
print(images.shape)
print(images)
visual_inputs = {'input': images.float()}
visual_output_info = self.vit.infer_shapes(
    [TensorInfo('input', trt.DataType.FLOAT, images.shape)])
visual_outputs = {
    t.name: torch.empty(tuple(t.shape),
                        dtype=trt_dtype_to_torch(t.dtype),
                        device='cuda')
    for t in visual_output_info
}
ok = self.vit.run(visual_inputs, visual_outputs, stream)
assert ok, "Runtime execution failed for vit session"
image_embeds = visual_outputs['output']  # [bs, 256, 4096]
print(image_embeds)

Different output using exmaples/qwenvl/pics/demo.jpeg

TensorrtLLM output

# image
torch.Size([1, 3, 448, 448])
tensor([[[[ 0.8647,  0.9084,  0.9230,  ...,  1.7552,  1.7552,  1.7552],
          [ 0.8792,  0.9376,  0.9376,  ...,  1.7552,  1.7552,  1.7552],
          [ 0.9230,  0.9230,  0.9376,  ...,  1.7552,  1.7552,  1.7552],
          ...,
          [-0.7704, -0.7704, -0.7412,  ..., -0.2886, -0.3178, -0.3908],
          [-0.7558, -0.7558, -0.7558,  ..., -0.3470, -0.4054, -0.4492],
          [-0.7558, -0.7558, -0.7704,  ..., -0.4054, -0.4492, -0.4930]],

         [[ 1.2194,  1.2495,  1.2645,  ...,  1.8948,  1.8948,  1.8948],
          [ 1.2344,  1.2495,  1.2795,  ...,  1.8948,  1.8948,  1.8948],
          [ 1.2344,  1.2645,  1.2945,  ...,  1.8948,  1.8948,  1.8948],
          ...,
          [-0.5815, -0.5815, -0.5515,  ..., -0.3564, -0.3714, -0.4464],
          [-0.5665, -0.5665, -0.5515,  ..., -0.3864, -0.4614, -0.5065],
          [-0.5665, -0.5815, -0.6115,  ..., -0.4464, -0.4914, -0.5515]],

         [[ 1.2927,  1.3211,  1.3354,  ...,  1.9753,  1.9753,  1.9753],
          [ 1.3069,  1.3354,  1.3496,  ...,  1.9753,  1.9753,  1.9753],
          [ 1.3211,  1.3354,  1.3638,  ...,  1.9753,  1.9753,  1.9753],
          ...,
          [-0.3426, -0.3284, -0.3000,  ..., -0.2146, -0.2431, -0.3284],
          [-0.3142, -0.3142, -0.2857,  ..., -0.2573, -0.3284, -0.3711],
          [-0.3142, -0.3284, -0.3568,  ..., -0.3284, -0.3711, -0.4137]]]],
       device='cuda:0')


# embedding
torch.Size([1, 256, 4096])
tensor([[[ 2.2480, -1.3076, -0.6943,  ..., -0.2272, -3.1777, -1.1953],
         [ 3.5098,  0.3196,  1.2432,  ...,  1.5215, -1.5166, -1.1787],
         [ 1.5010, -2.9395, -1.0654,  ...,  3.9258, -0.6914, -0.2371],
         ...,
         [ 0.5205, -2.0645, -0.0531,  ...,  3.9160,  0.8760,  6.0273],
         [ 1.4053, -2.9629, -0.0939,  ...,  1.6025,  1.9092,  1.5703],
         [-1.9521, -2.8320, -2.5430,  ...,  5.6758,  0.3870,  3.1934]]],
       device='cuda:0', dtype=torch.float16)

Qwen-VL-Chat ModelScope

# image
torch.Size([1, 3, 448, 448])
tensor([[[[ 0.8647,  0.9084,  0.9230,  ...,  1.7552,  1.7552,  1.7552],
          [ 0.8792,  0.9376,  0.9376,  ...,  1.7552,  1.7552,  1.7552],
          [ 0.9230,  0.9230,  0.9376,  ...,  1.7552,  1.7552,  1.7552],
          ...,
          [-0.7704, -0.7704, -0.7412,  ..., -0.2886, -0.3178, -0.3908],
          [-0.7558, -0.7558, -0.7558,  ..., -0.3470, -0.4054, -0.4492],
          [-0.7558, -0.7558, -0.7704,  ..., -0.4054, -0.4492, -0.4930]],

         [[ 1.2194,  1.2495,  1.2645,  ...,  1.8948,  1.8948,  1.8948],
          [ 1.2344,  1.2495,  1.2795,  ...,  1.8948,  1.8948,  1.8948],
          [ 1.2344,  1.2645,  1.2945,  ...,  1.8948,  1.8948,  1.8948],
          ...,
          [-0.5815, -0.5815, -0.5515,  ..., -0.3564, -0.3714, -0.4464],
          [-0.5665, -0.5665, -0.5515,  ..., -0.3864, -0.4614, -0.5065],
          [-0.5665, -0.5815, -0.6115,  ..., -0.4464, -0.4914, -0.5515]],

         [[ 1.2927,  1.3211,  1.3354,  ...,  1.9753,  1.9753,  1.9753],
          [ 1.3069,  1.3354,  1.3496,  ...,  1.9753,  1.9753,  1.9753],
          [ 1.3211,  1.3354,  1.3638,  ...,  1.9753,  1.9753,  1.9753],
          ...,
          [-0.3426, -0.3284, -0.3000,  ..., -0.2146, -0.2431, -0.3284],
          [-0.3142, -0.3142, -0.2857,  ..., -0.2573, -0.3284, -0.3711],
          [-0.3142, -0.3284, -0.3568,  ..., -0.3284, -0.3711, -0.4137]]]])

# embedding
torch.Size([1, 256, 4096])
tensor([[[ 0.5669,  2.1602,  0.5522,  ...,  1.6719, -1.1719,  0.6343],
         [-0.2158, -1.9053, -1.3213,  ..., -0.2773, -1.0303, -1.5508],
         [ 0.4644,  0.0384, -2.0176,  ...,  2.0605,  0.4480, -1.5918],
         ...,
         [ 0.0100, -1.0996,  0.6797,  ...,  6.4961, -1.7705,  4.5273],
         [-1.6621, -2.1875,  0.3442,  ...,  2.1309,  1.2607,  3.2891],
         [-2.6719, -2.6094, -2.9102,  ...,  2.0137,  2.4043,  0.7583]]],

The input image is same, but the output embeddings are not close.

Apr 22 '24 11:04 bnuzhanyu

Hi @bnuzhanyu, you got the "TensorrtLLM output" from the source code of "Tensorrt-llm Print input image and output embedding:", right? And how do you got the "Qwen-VL-Chat ModelScope" results? If the ViT engine and the input are same, the results are expected the same.

Apr 29 '24 02:04 sunnyqgg

Hi @bnuzhanyu, you got the "TensorrtLLM output" from the source code of "Tensorrt-llm Print input image and output embedding:", right? And how do you got the "Qwen-VL-Chat ModelScope" results? If the ViT engine and the input are same, the results are expected the same.

Yes, I modified trtllm source to get "TensorrtLLM output".
I use the code https://modelscope.cn/models/qwen/Qwen-VL-Chat/file/view/master?fileName=modeling_qwen.py&status=1

# line 565
images = self.visual.encode(images)

to get the "Qwen-VL-Chat ModelScope" results

Apr 29 '24 13:04 bnuzhanyu

any update?

Jun 06 '24 05:06 calico-niko

Hi @calico-niko @bnuzhanyu The ViT is offloaded to TRT, and the fp32 accuracy of it on TRT9.3 is alined with Pytorch. And you can also change the version of TRT from 9.3 to 10.x, the fp16 accuracy of it on TRT10.x is fine.

Jun 06 '24 05:06 sunnyqgg

Hi @calico-niko @bnuzhanyu The ViT is offloaded to TRT, and the fp32 accuracy of it on TRT9.3 is alined with Pytorch. And you can also change the version of TRT from 9.3 to 10.x, the fp16 accuracy of it on TRT10.x is fine.

I updated trt to 10.0.1 , but got the same diff. I think the results vary a lot.

vit by trt: tensor([[[ 1.1836, -0.6230, 0.5547, ..., -0.2083, -1.2617, -3.3867], [ 0.4189, -0.3127, -0.5732, ..., 0.8232, 1.0723, 0.8164], [-0.4614, -0.0329, 0.7266, ..., 0.3604, -1.1826, -0.0151], ..., [-0.3298, 1.5420, 1.1074, ..., 1.4434, -0.7012, -2.1191], [ 0.2686, -0.4331, -2.0234, ..., -0.1218, -0.9346, -0.0122], [-1.3047, 0.8560, -2.2266, ..., -0.5923, 1.6758, 0.3738]]], device='cuda:0', dtype=torch.float16)

vit by visual.encode(images): tensor([[ 1.2646, -0.6035, 0.6665, ..., -0.1432, -1.2197, -3.3691], [ 0.2174, -0.3809, -1.2285, ..., 1.6143, 0.8530, 0.3674], [-0.4709, -0.0422, 0.7241, ..., 0.3362, -1.1611, -0.0426], ..., [-0.3340, 1.5625, 1.0889, ..., 1.5625, -0.6138, -2.0215], [ 0.2142, -0.3896, -2.0020, ..., -0.0946, -0.9102, 0.0299], [-1.3281, 0.9053, -2.2617, ..., -0.5645, 1.7812, 0.4031]], device='cuda:0', dtype=torch.float16)

Jul 03 '24 08:07 hezeli123

@sunnyqgg Do you have any other suggestion?

Jul 15 '24 09:07 byshiue

Hi @hezeli123 , the diffs are smaller compared with TRT 9.x, does the current ViT diffs have a big impact on the final results? If so, you can try to run ViT with FP32 precision.

Jul 15 '24 10:07 sunnyqgg

The current ViT diffs have a big impact which results in many bad cases. I run ViT with FP32 precision now.

Jul 25 '24 07:07 hezeli123

OK， if you have strong desire to use FP16, I'll continue to look at this issue, if not, this issue will have a lower priority.

Jul 26 '24 02:07 sunnyqgg

OK， if you have strong desire to use FP16, I'll continue to look at this issue, if not, this issue will have a lower priority.

Hi @sunnyqgg . The effects of using trt-llm-0.12.0 with fp16 and fp32 precision are significantly different from those of Qwen-VL found at https://github.com/QwenLM/Qwen-VL.

trt-llm-0.12.0 with fp16：