TensorRT-LLM
TensorRT-LLM copied to clipboard
Qwen-VL-Chat vit embedding diff
Problem
For same input image , I get different output of the visual embedding, and this could make the result a little bit worse than original model.
env
tensorrt-llm 0.9.0, GPU: A10 model: https://modelscope.cn/models/qwen/Qwen-VL-Chat/summary
Qwen-VL-Chat build command
MAX_BATCH_SIZE=8
HF_MODEL_DIR=$MODEL_ROOT_DIR/Qwen-VL-Chat
ONNX_FILE=$MODEL_ROOT_DIR/visual_encoder/visual_encoder.onnx
PLAN_FILE=$MODEL_ROOT_DIR/plan/visual_encoder/visual_encoder_fp16.plan
CHECKPOINT_DIR=$MODEL_ROOT_DIR/qwen_vl_trt/checkpoint
ENGINE_DIR=$MODEL_ROOT_DIR/qwen_vl_trt/engine
MAX_INPUT_LEN=2048
MAX_OUTPUT_LEN=1024
MAX_PROMPT_EMBEDDING_TABLE_SIZE=$((MAX_BATCH_SIZE * 256))
export CUDA_VISIBLE_DEVICES=1
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
if [ ! -f $PLAN_FILE ]; then
CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES python3 vit_onnx_trt.py --pretrained_model_path $HF_MODEL_DIR \
--onnxFile $ONNX_FILE --planFile $PLAN_FILE --maxBS $MAX_BATCH_SIZE
fi
CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES python3 ../qwen/convert_checkpoint.py --model_dir=$HF_MODEL_DIR --output_dir=$CHECKPOINT_DIR
CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES trtllm-build --checkpoint_dir=$CHECKPOINT_DIR \
--gemm_plugin=float16 --gpt_attention_plugin=float16 \
--lookup_plugin=float16 --max_input_len=$MAX_INPUT_LEN --max_output_len=$MAX_OUTPUT_LEN \
--max_batch_size=$MAX_BATCH_SIZE --max_prompt_embedding_table_size=$MAX_PROMPT_EMBEDDING_TABLE_SIZE \
--remove_input_padding=enable \
--output_dir=$ENGINE_DIR
Tensorrt-llm Print input image and output embedding:
stream = torch.cuda.current_stream().cuda_stream
image_npy = self.image_preproc.encode([image_path]) # Preprocess
images = torch.cat([image_npy]).to(self.device) #([bs, 3, 448, 448])
batch_size = images.size(0)
images = images.expand(batch_size, -1, -1, -1).contiguous()
print(images.shape)
print(images)
visual_inputs = {'input': images.float()}
visual_output_info = self.vit.infer_shapes(
[TensorInfo('input', trt.DataType.FLOAT, images.shape)])
visual_outputs = {
t.name: torch.empty(tuple(t.shape),
dtype=trt_dtype_to_torch(t.dtype),
device='cuda')
for t in visual_output_info
}
ok = self.vit.run(visual_inputs, visual_outputs, stream)
assert ok, "Runtime execution failed for vit session"
image_embeds = visual_outputs['output'] # [bs, 256, 4096]
print(image_embeds)
Different output using exmaples/qwenvl/pics/demo.jpeg
TensorrtLLM output
# image
torch.Size([1, 3, 448, 448])
tensor([[[[ 0.8647, 0.9084, 0.9230, ..., 1.7552, 1.7552, 1.7552],
[ 0.8792, 0.9376, 0.9376, ..., 1.7552, 1.7552, 1.7552],
[ 0.9230, 0.9230, 0.9376, ..., 1.7552, 1.7552, 1.7552],
...,
[-0.7704, -0.7704, -0.7412, ..., -0.2886, -0.3178, -0.3908],
[-0.7558, -0.7558, -0.7558, ..., -0.3470, -0.4054, -0.4492],
[-0.7558, -0.7558, -0.7704, ..., -0.4054, -0.4492, -0.4930]],
[[ 1.2194, 1.2495, 1.2645, ..., 1.8948, 1.8948, 1.8948],
[ 1.2344, 1.2495, 1.2795, ..., 1.8948, 1.8948, 1.8948],
[ 1.2344, 1.2645, 1.2945, ..., 1.8948, 1.8948, 1.8948],
...,
[-0.5815, -0.5815, -0.5515, ..., -0.3564, -0.3714, -0.4464],
[-0.5665, -0.5665, -0.5515, ..., -0.3864, -0.4614, -0.5065],
[-0.5665, -0.5815, -0.6115, ..., -0.4464, -0.4914, -0.5515]],
[[ 1.2927, 1.3211, 1.3354, ..., 1.9753, 1.9753, 1.9753],
[ 1.3069, 1.3354, 1.3496, ..., 1.9753, 1.9753, 1.9753],
[ 1.3211, 1.3354, 1.3638, ..., 1.9753, 1.9753, 1.9753],
...,
[-0.3426, -0.3284, -0.3000, ..., -0.2146, -0.2431, -0.3284],
[-0.3142, -0.3142, -0.2857, ..., -0.2573, -0.3284, -0.3711],
[-0.3142, -0.3284, -0.3568, ..., -0.3284, -0.3711, -0.4137]]]],
device='cuda:0')
# embedding
torch.Size([1, 256, 4096])
tensor([[[ 2.2480, -1.3076, -0.6943, ..., -0.2272, -3.1777, -1.1953],
[ 3.5098, 0.3196, 1.2432, ..., 1.5215, -1.5166, -1.1787],
[ 1.5010, -2.9395, -1.0654, ..., 3.9258, -0.6914, -0.2371],
...,
[ 0.5205, -2.0645, -0.0531, ..., 3.9160, 0.8760, 6.0273],
[ 1.4053, -2.9629, -0.0939, ..., 1.6025, 1.9092, 1.5703],
[-1.9521, -2.8320, -2.5430, ..., 5.6758, 0.3870, 3.1934]]],
device='cuda:0', dtype=torch.float16)
Qwen-VL-Chat ModelScope
# image
torch.Size([1, 3, 448, 448])
tensor([[[[ 0.8647, 0.9084, 0.9230, ..., 1.7552, 1.7552, 1.7552],
[ 0.8792, 0.9376, 0.9376, ..., 1.7552, 1.7552, 1.7552],
[ 0.9230, 0.9230, 0.9376, ..., 1.7552, 1.7552, 1.7552],
...,
[-0.7704, -0.7704, -0.7412, ..., -0.2886, -0.3178, -0.3908],
[-0.7558, -0.7558, -0.7558, ..., -0.3470, -0.4054, -0.4492],
[-0.7558, -0.7558, -0.7704, ..., -0.4054, -0.4492, -0.4930]],
[[ 1.2194, 1.2495, 1.2645, ..., 1.8948, 1.8948, 1.8948],
[ 1.2344, 1.2495, 1.2795, ..., 1.8948, 1.8948, 1.8948],
[ 1.2344, 1.2645, 1.2945, ..., 1.8948, 1.8948, 1.8948],
...,
[-0.5815, -0.5815, -0.5515, ..., -0.3564, -0.3714, -0.4464],
[-0.5665, -0.5665, -0.5515, ..., -0.3864, -0.4614, -0.5065],
[-0.5665, -0.5815, -0.6115, ..., -0.4464, -0.4914, -0.5515]],
[[ 1.2927, 1.3211, 1.3354, ..., 1.9753, 1.9753, 1.9753],
[ 1.3069, 1.3354, 1.3496, ..., 1.9753, 1.9753, 1.9753],
[ 1.3211, 1.3354, 1.3638, ..., 1.9753, 1.9753, 1.9753],
...,
[-0.3426, -0.3284, -0.3000, ..., -0.2146, -0.2431, -0.3284],
[-0.3142, -0.3142, -0.2857, ..., -0.2573, -0.3284, -0.3711],
[-0.3142, -0.3284, -0.3568, ..., -0.3284, -0.3711, -0.4137]]]])
# embedding
torch.Size([1, 256, 4096])
tensor([[[ 0.5669, 2.1602, 0.5522, ..., 1.6719, -1.1719, 0.6343],
[-0.2158, -1.9053, -1.3213, ..., -0.2773, -1.0303, -1.5508],
[ 0.4644, 0.0384, -2.0176, ..., 2.0605, 0.4480, -1.5918],
...,
[ 0.0100, -1.0996, 0.6797, ..., 6.4961, -1.7705, 4.5273],
[-1.6621, -2.1875, 0.3442, ..., 2.1309, 1.2607, 3.2891],
[-2.6719, -2.6094, -2.9102, ..., 2.0137, 2.4043, 0.7583]]],
The input image is same, but the output embeddings are not close.
Hi @bnuzhanyu, you got the "TensorrtLLM output" from the source code of "Tensorrt-llm Print input image and output embedding:", right? And how do you got the "Qwen-VL-Chat ModelScope" results? If the ViT engine and the input are same, the results are expected the same.
Hi @bnuzhanyu, you got the "TensorrtLLM output" from the source code of "Tensorrt-llm Print input image and output embedding:", right? And how do you got the "Qwen-VL-Chat ModelScope" results? If the ViT engine and the input are same, the results are expected the same.
- Yes, I modified trtllm source to get "TensorrtLLM output".
- I use the code https://modelscope.cn/models/qwen/Qwen-VL-Chat/file/view/master?fileName=modeling_qwen.py&status=1
# line 565
images = self.visual.encode(images)
to get the "Qwen-VL-Chat ModelScope" results
any update?
Hi @calico-niko @bnuzhanyu The ViT is offloaded to TRT, and the fp32 accuracy of it on TRT9.3 is alined with Pytorch. And you can also change the version of TRT from 9.3 to 10.x, the fp16 accuracy of it on TRT10.x is fine.
Hi @calico-niko @bnuzhanyu The ViT is offloaded to TRT, and the fp32 accuracy of it on TRT9.3 is alined with Pytorch. And you can also change the version of TRT from 9.3 to 10.x, the fp16 accuracy of it on TRT10.x is fine.
I updated trt to 10.0.1 , but got the same diff. I think the results vary a lot.
vit by trt: tensor([[[ 1.1836, -0.6230, 0.5547, ..., -0.2083, -1.2617, -3.3867], [ 0.4189, -0.3127, -0.5732, ..., 0.8232, 1.0723, 0.8164], [-0.4614, -0.0329, 0.7266, ..., 0.3604, -1.1826, -0.0151], ..., [-0.3298, 1.5420, 1.1074, ..., 1.4434, -0.7012, -2.1191], [ 0.2686, -0.4331, -2.0234, ..., -0.1218, -0.9346, -0.0122], [-1.3047, 0.8560, -2.2266, ..., -0.5923, 1.6758, 0.3738]]], device='cuda:0', dtype=torch.float16)
vit by visual.encode(images): tensor([[ 1.2646, -0.6035, 0.6665, ..., -0.1432, -1.2197, -3.3691], [ 0.2174, -0.3809, -1.2285, ..., 1.6143, 0.8530, 0.3674], [-0.4709, -0.0422, 0.7241, ..., 0.3362, -1.1611, -0.0426], ..., [-0.3340, 1.5625, 1.0889, ..., 1.5625, -0.6138, -2.0215], [ 0.2142, -0.3896, -2.0020, ..., -0.0946, -0.9102, 0.0299], [-1.3281, 0.9053, -2.2617, ..., -0.5645, 1.7812, 0.4031]], device='cuda:0', dtype=torch.float16)
@sunnyqgg Do you have any other suggestion?
Hi @hezeli123 , the diffs are smaller compared with TRT 9.x, does the current ViT diffs have a big impact on the final results? If so, you can try to run ViT with FP32 precision.
The current ViT diffs have a big impact which results in many bad cases. I run ViT with FP32 precision now.
OK, if you have strong desire to use FP16, I'll continue to look at this issue, if not, this issue will have a lower priority.
OK, if you have strong desire to use FP16, I'll continue to look at this issue, if not, this issue will have a lower priority.
Hi @sunnyqgg . The effects of using trt-llm-0.12.0 with fp16 and fp32 precision are significantly different from those of Qwen-VL found at https://github.com/QwenLM/Qwen-VL.
trt-llm-0.12.0 with fp16:
trt-llm-0.12.0 with fp32:
https://github.com/QwenLM/Qwen-VL: