TensorRT-LLM The llava model batch inference result is different with batch=1

The llava model batch inference result is different with batch=1

Open lss15151161 opened this issue 8 months ago • 5 comments

System info

GPU: A100 tensorrt 9.3.0.post12.dev1 tensorrt-llm 0.9.0 torch 2.2.2

Reproduction

export MODEL_NAME="llava-1.5-7b-hf"
git clone https://huggingface.co/llava-hf/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}

python ../llama/convert_checkpoint.py \
    --model_dir tmp/hf_models/${MODEL_NAME} \
    --output_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
    --dtype float16

trtllm-build \
    --checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
    --output_dir trt_engines/${MODEL_NAME}/fp16/1-gpu \
    --gemm_plugin float16 \
    --use_fused_mlp \
    --max_batch_size 16 \
    --max_input_len 2048 \
    --max_output_len 512 \
    --max_multimodal_len 9216 # 1 (max_batch_size) * 576 (num_visual_features)

python build_visual_engine.py --model_path tmp/hf_models/${MODEL_NAME} --model_type llava # or "--model_type vila" for VILA

python run.py \
    --max_new_tokens 20 \
    --hf_model_dir tmp/hf_models/${MODEL_NAME} \
    --visual_engine_dir visual_engines/${MODEL_NAME} \
    --llm_engine_dir trt_engines/${MODEL_NAME}/fp16/1-gpu \
    --decoder_llm \
    --input_text "Question: which city is this? Answer:"
    --batch_size 16

if I use the same data to form a batch，the result like this:

and if I use two different prompt to form a batch，the reslt like this:

The image used is : https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/merlion.png

Jun 26 '24 07:06 lss15151161

TensorRT-LLM TensorRT-LLM copied to clipboard

The llava model batch inference result is different with batch=1

System info

Reproduction

TensorRT-LLM
TensorRT-LLM copied to clipboard