TopFormer icon indicating copy to clipboard operation
TopFormer copied to clipboard

What is the difference between the attention block in TopFormer with None-local block?

Open ChenDirk opened this issue 3 years ago • 3 comments

ChenDirk avatar May 07 '22 08:05 ChenDirk

Besides reducing the dim of Q and K, we use the multi-head self-attention rather than the 1-head self-attention used in the Non-Local block.

speedinghzl avatar May 11 '22 11:05 speedinghzl

Thank you for your reply, I check the code, the implementation method is amazing! I also find the LayerNorm is very slow in inference, and the batchnorm can merge in convolutional layer, BN will not add FLOPs, amazing design! But I found that in the Attention block, there is one activation function, that is different with the MultiHeadAttention layer, did you compare the performance with/without activation function?

ChenDirk avatar May 11 '22 12:05 ChenDirk

The inserted activation function is intended initially to increase non-linearity. However, we found that removing the activation function could achieve slightly better performance. I suggest removing it when using Topformer.

speedinghzl avatar May 11 '22 12:05 speedinghzl