LAVIS
LAVIS copied to clipboard
How to handle multiple images with Blip2 models ?
How to handle multiple images with Blip2 models? I have a large number of questions which require more than one image to answer for VQA task, like 1 questions vs image set. Can I extracting the features from each image in my image set and then concat them as input to the Qformer? Thx.
Yes you can precisely do that. The input length for transformer can vary due to the cross-attention mechanism. However, it is suggested to fine-tune the model to adapt to multiple images.
Hello, Sorry, I know this issue has been closed. However, may I ask how you cope with two or more images as inputs for visual QA with BLIP-2? Any public codes or tips would be appreciated. Thanks.