BLIP icon indicating copy to clipboard operation
BLIP copied to clipboard

Is the LM better than MLM?

Open SKBL5694 opened this issue 2 years ago • 3 comments

I found in the paper BLIP, you use define the loss as ITC + ITM + LM. However, in ALBEF, the loss is defined as ITC + ITM +MLM. Is the LM better than MLM or or there are other reasons you used LM instead of MLM?

SKBL5694 avatar Sep 22 '22 09:09 SKBL5694

Hi, the primary reason for using LM is because we want to enable image-to-text generation capability. Both losses perform similarly in terms of VL representation learning (MLM can be slightly better sometimes).

LiJunnan1992 avatar Sep 23 '22 01:09 LiJunnan1992

Hi, the primary reason for using LM is because we want to enable image-to-text generation capability. Both losses perform similarly in terms of VL representation learning (MLM can be slightly better sometimes).

Thanks for reply. In the paper ALBEF Chapter 5 and Chapter 6, I find that model can also do the VQA task. And in that paper, you say you consider VQA as an answer generation problem. Is that mean you add a decoder for the VQA task(a downstream task), and train a task-specific decoder not included in the "pre-train model"? However, in BLIP, that decoder is included in the "pre-train model". Am I right?

SKBL5694 avatar Sep 23 '22 02:09 SKBL5694

For BLIP, the decoder in the pre-trained model. For ALBEF, we use the pre-trained encoder model to initialize the decoder.

LiJunnan1992 avatar Sep 24 '22 01:09 LiJunnan1992