unilm icon indicating copy to clipboard operation
unilm copied to clipboard

Question for the vison-language experts

Open liuxuannan opened this issue 1 year ago • 4 comments

When I was using the BEIT3-base model, I found that in the Multiway Network defined in the model source code, only Vison-Expert and language-Expert are defined for FFD. Where is vison-language expert defined?

liuxuannan avatar Jun 09 '23 03:06 liuxuannan

Hi @liuxuannan,

We have done some architecture explorations and found that decoupling attention parameters obtains improvements, so we release the models using the better architecure.

Please refer to Section 11 at BEiT-3 Supp

wenhui0924 avatar Jun 15 '23 02:06 wenhui0924

hi @wenhui0924,

if Attention is decoupled, is pre-training still mask-then-predict? Is pre-training equivalent to two modalities trained with two transformer models respectively?

RayleighChen avatar Jun 15 '23 07:06 RayleighChen

The pretraining is still mask-then-predict. It is not two modalities trained with two transformer models. We also use the image-text pairs to train the model to learn the alignments of different modalities. The modal can also perform deep fusion of images and text as shared attention.

wenhui0924 avatar Jun 15 '23 08:06 wenhui0924

Hi, @wenhui0924. To handle the two modalities, the released Beit3 model has two independent modules, A and B. There seems to be no self-attention between the two modules? Could you describe the interactions between the two modalities during pre-training? I look forward to your reply.

RayleighChen avatar Jun 15 '23 08:06 RayleighChen

Hi, @wenhui0924. To handle the two modalities, the released Beit3 model has two independent modules, A and B. There seems to be no self-attention between the two modules? Could you describe the interactions between the two modalities during pre-training? I look forward to your reply.

It seems interactions between two modalities based on the shared attention module. I also confuse with the difference between paper and released code.

zhouruikun avatar Jul 20 '23 04:07 zhouruikun

The model configuration on layers is 40 on the paper, but refer to the larger configuration from the released code and pre-trained checkpoints only with 24 layers, does the 40-layer pre-trained model dose publicly released?

zhouruikun avatar Jul 20 '23 04:07 zhouruikun