ALBEF
ALBEF copied to clipboard
Architecture of ALBEF
Hello I would like to do some experiments using ALBEF model. For this I reviewed your paper as well, but I am unable to understand why first six layers of bert base was used as text encoder and why last six layers are used as multimodal encoder? Why didn't the entire BERT_base with all 12 layers was used as text encoder and multimodal encoder? Your help in this regard would be greatly appreciated. @LiJunnan1992 @svc-scm @chenxwh