unilm icon indicating copy to clipboard operation
unilm copied to clipboard

BIET-3 self-attention is not shared across modality

Open xinghua-qu opened this issue 1 year ago • 1 comments

Hi,

I found that the model used in BIET-3 based on torchscale is not as what the paper described. In the multiway transformer, the self-attention layer should be shared across different modality. However, it is not the case in the implementation as the screenshot shown. It seems there are two parallel self attention layers (A and B), instead of one shared. image

xinghua-qu avatar Jun 01 '23 01:06 xinghua-qu

Hi @xinghua-qu,

Please refer to Section 11 at BEiT-3 Supp.

We have done some architecture explorations and found that we can decouple attention parameters while still maintaining the ability to perform deep fusion. So we released models using this architecture.

wenhui0924 avatar Jun 15 '23 02:06 wenhui0924