ALBEF Which layers of BERT are used for MLM Loss?

Which layers of BERT are used for MLM Loss?

Open rakeshchada opened this issue 2 years ago • 3 comments

Hi,

It seems from Figure 1 of the paper that only the last 6 of the 12 layers in the BERT are used for the MLM loss. However, when I look at the code here, it seems like all 12 layers are used.

It would be great if you can clarify this. Thanks.

Mar 08 '22 23:03 rakeshchada

Hi, the entire model is trained end-to-end with MLM loss. The first 6 layers of BERT are text-only whereas its last 6 layers receive image features through cross-attention.

Mar 09 '22 00:03 LiJunnan1992

I have the same question. It seems that 'multi_modal' is selected with MLM loss, which all layers receive image features.

Jun 28 '22 06:06 Sry2016

Hi,

It seems from Figure 1 of the paper that only the last 6 of the 12 layers in the BERT are used for the MLM loss. However, when I look at the code here, it seems like all 12 layers are used.

It would be great if you can clarify this. Thanks.

oh, i figure it out. Please see line 451 in xbert.py.

Jun 28 '22 06:06 Sry2016

ALBEF ALBEF copied to clipboard

Which layers of BERT are used for MLM Loss?

ALBEF
ALBEF copied to clipboard