Swin-Transformer Unnecessary proj in WindowAttention?

Unnecessary proj in WindowAttention?

Open askerlee opened this issue 2 years ago • 0 comments

We know that two linear transformations in a row can be merged into one linear transformation, if there's no activation function between them. In https://github.com/microsoft/Swin-Transformer/blob/main/models/swin_transformer.py#L141-L142

x = (attn @ v).transpose(1, 2).reshape(B_, N, C)
x = self.proj(x)

v is obtained by a linear layer (part of qkv). Then proj(att @ v) should be equivalent to att @ proj(v)? In theory, proj(v) can be combined into one linear transformation. Therefore seems it's not necessary to have this self.proj? Except that an extra dropout operation may somehow add feature robustness. Would there be any performance degradation if we remove self.proj? Thanks.

Dec 08 '21 07:12 askerlee

Swin-Transformer Swin-Transformer copied to clipboard

Unnecessary proj in WindowAttention?

Swin-Transformer
Swin-Transformer copied to clipboard