unilm
unilm copied to clipboard
Question for the vison-language experts
When I was using the BEIT3-base model, I found that in the Multiway Network defined in the model source code, only Vison-Expert and language-Expert are defined for FFD. Where is vison-language expert defined?
Hi @liuxuannan,
We have done some architecture explorations and found that decoupling attention parameters obtains improvements, so we release the models using the better architecure.
Please refer to Section 11 at BEiT-3 Supp
hi @wenhui0924,
if Attention is decoupled, is pre-training still mask-then-predict? Is pre-training equivalent to two modalities trained with two transformer models respectively?
The pretraining is still mask-then-predict. It is not two modalities trained with two transformer models. We also use the image-text pairs to train the model to learn the alignments of different modalities. The modal can also perform deep fusion of images and text as shared attention.
Hi, @wenhui0924. To handle the two modalities, the released Beit3 model has two independent modules, A and B. There seems to be no self-attention between the two modules? Could you describe the interactions between the two modalities during pre-training? I look forward to your reply.
Hi, @wenhui0924. To handle the two modalities, the released Beit3 model has two independent modules, A and B. There seems to be no self-attention between the two modules? Could you describe the interactions between the two modalities during pre-training? I look forward to your reply.
It seems interactions between two modalities based on the shared attention module. I also confuse with the difference between paper and released code.
The model configuration on layers is 40 on the paper, but refer to the larger configuration from the released code and pre-trained checkpoints only with 24 layers, does the 40-layer pre-trained model dose publicly released?