ALBEF
ALBEF copied to clipboard
Which layers of BERT are used for MLM Loss?
Hi,
It seems from Figure 1 of the paper that only the last 6 of the 12 layers in the BERT are used for the MLM loss. However, when I look at the code here, it seems like all 12 layers are used.
It would be great if you can clarify this. Thanks.
Hi, the entire model is trained end-to-end with MLM loss. The first 6 layers of BERT are text-only whereas its last 6 layers receive image features through cross-attention.
I have the same question. It seems that 'multi_modal' is selected with MLM loss, which all layers receive image features.
Hi,
It seems from Figure 1 of the paper that only the last 6 of the 12 layers in the BERT are used for the MLM loss. However, when I look at the code here, it seems like all 12 layers are used.
It would be great if you can clarify this. Thanks.
oh, i figure it out. Please see line 451 in xbert.py.