LAVIS
LAVIS copied to clipboard
issues of the VQA-v2 training details for the BLIP2
Hi, thanks for your excellent work! When I finetune the pretrained model weights on the VQA-v2 dataset, I found an issue. In your paper said, the extracted image features and the input question are concatenated as the input of the Q-Former.
But, I noticed in your codes the Q-Former word_embedding and position embedding layers are both None. So I wonder how you implemented this part.
Should I remove these three lines?
``I did the following to condition Q-Former. It might not be correct but it is a start.
-
Commented out these lines https://github.com/salesforce/LAVIS/blob/47e0f3f25ca763975738c7224c8369207812ce6c/lavis/models/blip2_models/blip2_t5.py#L75-L79
-
Replaced this code in forward() https://github.com/salesforce/LAVIS/blob/47e0f3f25ca763975738c7224c8369207812ce6c/lavis/models/blip2_models/blip2_t5.py#L112-L117
-
and same code in predict_answers: https://github.com/salesforce/LAVIS/blob/aff3008b75b64f151d2a3918c0419fc195072008/lavis/models/blip2_models/blip2_t5.py#L269-L274
with
text_tokens = self.tokenizer(
samples["text_input"],
padding="max_length",
truncation=True,
max_length=self.max_txt_len,
return_tensors="pt",
).to(image.device)
query_atts_itm = torch.ones(query_tokens.size()[:-1], dtype=torch.long).to(
image.device
)
attention_mask_all = torch.cat([query_atts_itm, text_tokens.attention_mask], dim=1)
query_output = self.Qformer.bert(
input_ids=text_tokens.input_ids,
attention_mask=attention_mask_all,
query_embeds=query_tokens,
encoder_hidden_states=image_embeds,
encoder_attention_mask=image_atts,
return_dict=True,
)
Inspired myself from pre-training code: https://github.com/salesforce/LAVIS/blob/47e0f3f25ca763975738c7224c8369207812ce6c/lavis/models/blip2_models/blip2_qformer.py#L213
``I did the following to condition Q-Former. It might not be correct but it is a start.
- Commented out these lines https://github.com/salesforce/LAVIS/blob/47e0f3f25ca763975738c7224c8369207812ce6c/lavis/models/blip2_models/blip2_t5.py#L75-L79
- Replaced this code in forward() https://github.com/salesforce/LAVIS/blob/47e0f3f25ca763975738c7224c8369207812ce6c/lavis/models/blip2_models/blip2_t5.py#L112-L117
- and same code in predict_answers: https://github.com/salesforce/LAVIS/blob/aff3008b75b64f151d2a3918c0419fc195072008/lavis/models/blip2_models/blip2_t5.py#L269-L274
with
text_tokens = self.tokenizer( samples["text_input"], padding="max_length", truncation=True, max_length=self.max_txt_len, return_tensors="pt", ).to(image.device) query_atts_itm = torch.ones(query_tokens.size()[:-1], dtype=torch.long).to( image.device ) attention_mask_all = torch.cat([query_atts_itm, text_tokens.attention_mask], dim=1) query_output = self.Qformer.bert( input_ids=text_tokens.input_ids, attention_mask=attention_mask_all, query_embeds=query_tokens, encoder_hidden_states=image_embeds, encoder_attention_mask=image_atts, return_dict=True, )
Inspired myself from pre-training code:
https://github.com/salesforce/LAVIS/blob/47e0f3f25ca763975738c7224c8369207812ce6c/lavis/models/blip2_models/blip2_qformer.py#L213
I meet this error: RuntimeError: Error(s) in loading state_dict for Blip2OPT: size mismatch for Qformer.bert.embeddings.word_embeddings.weight: copying a param with shape torch.Size([30523, 768]) from checkpoint, the shape in current model is torch.Size([30522, 768]).
Hi, thanks for your excellent work! When I finetune the pretrained model weights on the VQA-v2 dataset, I found an issue. In your paper said, the extracted image features and the input question are concatenated as the input of the Q-Former.
But, I noticed in your codes the Q-Former word_embedding and position embedding layers are both None. So I wonder how you implemented this part.
Should I remove these three lines? Excuse me, do you know the reason why Q-Former word_embedding and position embedding layers are both None
Hi, thanks for your excellent work! When I finetune the pretrained model weights on the VQA-v2 dataset, I found an issue. In your paper said, the extracted image features and the input question are concatenated as the input of the Q-Former.
But, I noticed in your codes the Q-Former word_embedding and position embedding layers are both None. So I wonder how you implemented this part.
Should I remove these three lines?
Excuse me, do you know the reason why Q-Former word_embedding and position embedding layers are both None
Hi, thanks for your excellent work! When I finetune the pretrained model weights on the VQA-v2 dataset, I found an issue. In your paper said, the extracted image features and the input question are concatenated as the input of the Q-Former.
But, I noticed in your codes the Q-Former word_embedding and position embedding layers are both None. So I wonder how you implemented this part.
Should I remove these three lines?
Excuse me, do you know the reason why Q-Former word_embedding and position embedding layers are both None
I guess that if you do not set these None you will get error like me:
RuntimeError: Error(s) in loading state_dict for Blip2OPT: size mismatch for Qformer.bert.embeddings.word_embeddings.weight: copying a param with shape torch.Size([30523, 768]) from checkpoint, the shape in current model is torch.Size([30522, 768]).
If you only train image captioning you do not need these, so the researchers set these none to avoid this bug.
Excuse me, I am also working on finetuning VQA on BLIP2. In the paper, I find that the Prompt used for VQA is "Question: {} Answer:". I would like to ask if my understanding is correct: when training, we don't utilize the prompt and only use the original question input; when testing, we utilize the prompt to reformat the question input to get a better performance. I will appreciate it if you could kindly help. Thanks.