TensorRT-LLM
TensorRT-LLM copied to clipboard
The llava model batch inference result is different with batch=1
System info
GPU: A100 tensorrt 9.3.0.post12.dev1 tensorrt-llm 0.9.0 torch 2.2.2
Reproduction
export MODEL_NAME="llava-1.5-7b-hf"
git clone https://huggingface.co/llava-hf/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
python ../llama/convert_checkpoint.py \
--model_dir tmp/hf_models/${MODEL_NAME} \
--output_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
--dtype float16
trtllm-build \
--checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
--output_dir trt_engines/${MODEL_NAME}/fp16/1-gpu \
--gemm_plugin float16 \
--use_fused_mlp \
--max_batch_size 16 \
--max_input_len 2048 \
--max_output_len 512 \
--max_multimodal_len 9216 # 1 (max_batch_size) * 576 (num_visual_features)
python build_visual_engine.py --model_path tmp/hf_models/${MODEL_NAME} --model_type llava # or "--model_type vila" for VILA
python run.py \
--max_new_tokens 20 \
--hf_model_dir tmp/hf_models/${MODEL_NAME} \
--visual_engine_dir visual_engines/${MODEL_NAME} \
--llm_engine_dir trt_engines/${MODEL_NAME}/fp16/1-gpu \
--decoder_llm \
--input_text "Question: which city is this? Answer:"
--batch_size 16
if I use the same data to form a batch,the result like this:
and if I use two different prompt to form a batch,the reslt like this:
The image used is : https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/merlion.png