unilm
unilm copied to clipboard
BIET-3 self-attention is not shared across modality
Hi,
I found that the model used in BIET-3 based on torchscale is not as what the paper described.
In the multiway transformer, the self-attention layer should be shared across different modality. However, it is not the case in the implementation as the screenshot shown. It seems there are two parallel self attention layers (A and B), instead of one shared.
Hi @xinghua-qu,
Please refer to Section 11 at BEiT-3 Supp.
We have done some architecture explorations and found that we can decouple attention parameters while still maintaining the ability to perform deep fusion. So we released models using this architecture.