LAVIS icon indicating copy to clipboard operation
LAVIS copied to clipboard

issues of the VQA-v2 training details for the BLIP2

Open runzeer opened this issue 1 year ago • 5 comments

Hi, thanks for your excellent work! When I finetune the pretrained model weights on the VQA-v2 dataset, I found an issue. In your paper said, the extracted image features and the input question are concatenated as the input of the Q-Former. image

But, I noticed in your codes the Q-Former word_embedding and position embedding layers are both None. So I wonder how you implemented this part. image Should I remove these three lines?

runzeer avatar Mar 17 '23 09:03 runzeer

``I did the following to condition Q-Former. It might not be correct but it is a start.

  1. Commented out these lines https://github.com/salesforce/LAVIS/blob/47e0f3f25ca763975738c7224c8369207812ce6c/lavis/models/blip2_models/blip2_t5.py#L75-L79

  2. Replaced this code in forward() https://github.com/salesforce/LAVIS/blob/47e0f3f25ca763975738c7224c8369207812ce6c/lavis/models/blip2_models/blip2_t5.py#L112-L117

  3. and same code in predict_answers: https://github.com/salesforce/LAVIS/blob/aff3008b75b64f151d2a3918c0419fc195072008/lavis/models/blip2_models/blip2_t5.py#L269-L274

with

text_tokens = self.tokenizer(
    samples["text_input"],
    padding="max_length",
    truncation=True,
    max_length=self.max_txt_len,
    return_tensors="pt",
).to(image.device)

query_atts_itm = torch.ones(query_tokens.size()[:-1], dtype=torch.long).to(
    image.device
)
attention_mask_all = torch.cat([query_atts_itm, text_tokens.attention_mask], dim=1)
query_output = self.Qformer.bert(
    input_ids=text_tokens.input_ids,
    attention_mask=attention_mask_all,
    query_embeds=query_tokens,
    encoder_hidden_states=image_embeds,
    encoder_attention_mask=image_atts,
    return_dict=True,
)

Inspired myself from pre-training code: https://github.com/salesforce/LAVIS/blob/47e0f3f25ca763975738c7224c8369207812ce6c/lavis/models/blip2_models/blip2_qformer.py#L213

kondvit avatar Apr 03 '23 22:04 kondvit

``I did the following to condition Q-Former. It might not be correct but it is a start.

  1. Commented out these lines https://github.com/salesforce/LAVIS/blob/47e0f3f25ca763975738c7224c8369207812ce6c/lavis/models/blip2_models/blip2_t5.py#L75-L79
  2. Replaced this code in forward() https://github.com/salesforce/LAVIS/blob/47e0f3f25ca763975738c7224c8369207812ce6c/lavis/models/blip2_models/blip2_t5.py#L112-L117
  3. and same code in predict_answers: https://github.com/salesforce/LAVIS/blob/aff3008b75b64f151d2a3918c0419fc195072008/lavis/models/blip2_models/blip2_t5.py#L269-L274

with

text_tokens = self.tokenizer(
    samples["text_input"],
    padding="max_length",
    truncation=True,
    max_length=self.max_txt_len,
    return_tensors="pt",
).to(image.device)

query_atts_itm = torch.ones(query_tokens.size()[:-1], dtype=torch.long).to(
    image.device
)
attention_mask_all = torch.cat([query_atts_itm, text_tokens.attention_mask], dim=1)
query_output = self.Qformer.bert(
    input_ids=text_tokens.input_ids,
    attention_mask=attention_mask_all,
    query_embeds=query_tokens,
    encoder_hidden_states=image_embeds,
    encoder_attention_mask=image_atts,
    return_dict=True,
)

Inspired myself from pre-training code:

https://github.com/salesforce/LAVIS/blob/47e0f3f25ca763975738c7224c8369207812ce6c/lavis/models/blip2_models/blip2_qformer.py#L213

I meet this error: RuntimeError: Error(s) in loading state_dict for Blip2OPT: size mismatch for Qformer.bert.embeddings.word_embeddings.weight: copying a param with shape torch.Size([30523, 768]) from checkpoint, the shape in current model is torch.Size([30522, 768]).

Yujianyuan avatar Apr 16 '23 10:04 Yujianyuan

Hi, thanks for your excellent work! When I finetune the pretrained model weights on the VQA-v2 dataset, I found an issue. In your paper said, the extracted image features and the input question are concatenated as the input of the Q-Former. image

But, I noticed in your codes the Q-Former word_embedding and position embedding layers are both None. So I wonder how you implemented this part. image Should I remove these three lines? Excuse me, do you know the reason why Q-Former word_embedding and position embedding layers are both None

fmdmm avatar Apr 17 '23 11:04 fmdmm

Hi, thanks for your excellent work! When I finetune the pretrained model weights on the VQA-v2 dataset, I found an issue. In your paper said, the extracted image features and the input question are concatenated as the input of the Q-Former. image

But, I noticed in your codes the Q-Former word_embedding and position embedding layers are both None. So I wonder how you implemented this part. image Should I remove these three lines?

Excuse me, do you know the reason why Q-Former word_embedding and position embedding layers are both None

fmdmm avatar Apr 17 '23 11:04 fmdmm

Hi, thanks for your excellent work! When I finetune the pretrained model weights on the VQA-v2 dataset, I found an issue. In your paper said, the extracted image features and the input question are concatenated as the input of the Q-Former. image But, I noticed in your codes the Q-Former word_embedding and position embedding layers are both None. So I wonder how you implemented this part. image Should I remove these three lines?

Excuse me, do you know the reason why Q-Former word_embedding and position embedding layers are both None

I guess that if you do not set these None you will get error like me:

RuntimeError: Error(s) in loading state_dict for Blip2OPT: size mismatch for Qformer.bert.embeddings.word_embeddings.weight: copying a param with shape torch.Size([30523, 768]) from checkpoint, the shape in current model is torch.Size([30522, 768]).

If you only train image captioning you do not need these, so the researchers set these none to avoid this bug.

Yujianyuan avatar Apr 17 '23 11:04 Yujianyuan

Excuse me, I am also working on finetuning VQA on BLIP2. In the paper, I find that the Prompt used for VQA is "Question: {} Answer:". I would like to ask if my understanding is correct: when training, we don't utilize the prompt and only use the original question input; when testing, we utilize the prompt to reformat the question input to get a better performance. I will appreciate it if you could kindly help. Thanks.

qwqwq1445 avatar Dec 22 '23 02:12 qwqwq1445