unilm BIET-3 self-attention is not shared across modality

BIET-3 self-attention is not shared across modality

Open xinghua-qu opened this issue 1 year ago • 1 comments

Hi,

I found that the model used in BIET-3 based on torchscale is not as what the paper described. In the multiway transformer, the self-attention layer should be shared across different modality. However, it is not the case in the implementation as the screenshot shown. It seems there are two parallel self attention layers (A and B), instead of one shared.

Jun 01 '23 01:06 xinghua-qu

Hi @xinghua-qu,

Please refer to Section 11 at BEiT-3 Supp.

We have done some architecture explorations and found that we can decouple attention parameters while still maintaining the ability to perform deep fusion. So we released models using this architecture.

Jun 15 '23 02:06 wenhui0924

unilm unilm copied to clipboard

BIET-3 self-attention is not shared across modality

unilm
unilm copied to clipboard