LAVIS
LAVIS copied to clipboard
issues of the VQA-v2 training details for the BLIP2
Hi, thanks for your excellent work! When I finetune the pretrained model weights on the VQA-v2 dataset, I found an issue. In your paper said, the extracted image features and the input question are concatenated as the input of the Q-Former.
But, I noticed in your codes the Q-Former word_embedding and position embedding layers are both None. So I wonder how you implemented this part.
Should I remove these three lines?