Compact-Transformers icon indicating copy to clipboard operation
Compact-Transformers copied to clipboard

Order of `LayerNorm` & `Residual`

Open carefree0910 opened this issue 4 years ago • 1 comments

First of all, thanks for your amazing work!

And it seems that your TransformerEncoderLayer implementation is a bit different from the 'mainstream' implementations, because you create your residual link after the LayerNorm procedure:

https://github.com/SHI-Labs/Compact-Transformers/blob/3f3d093746bc58213d9e9af4431242d305717855/src/utils/transformers.py#L96-L99

However, from the original paper of ViT and many other implementations, the residual link is created before the LayerNorm procedure:

src = src + self.drop_path(self.self_attn(self.pre_norm(src)))
src2 = self.norm1(src)
src2 = self.linear2(self.dropout1(self.activation(self.linear1(src2))))
src = src + self.drop_path(self.dropout2(src2))

I'm just wondering whether this is on purpose or some kind of 'typo'? Thanks in advance!

carefree0910 avatar Sep 08 '21 07:09 carefree0910

Hi, thank you for your interest.

First off, great catch! This tiny difference was a result of a bad merge in a very old version of our repository. We were based on PyTorch's nn.TransformerEncoderLayer (at the time did not support pre_norm so we wrote our own), and once we merged branches with the new implementation, those two lines must have not merged correctly.

It should in fact be src2 = self.norm1(src) and src2 should be fed to linear1 in the following line.

However, after running a couple of experiments, we did not find that this change impacts the performance of the network to a noticeable degree. Most runs were well within the margin of error.

Changing this now will break our current checkpoints and we want to maintain reproducibility. We will continue to investigate the issue.

Again, thank you for bringing this to our attention!

alihassanijr avatar Sep 09 '21 02:09 alihassanijr

Closing due to inactivity.

Just to note: pre-norm vs post-norm is mostly about stability. Pre-norm gradients are smaller than post-norm gradients.

stevenwalton avatar Mar 24 '23 21:03 stevenwalton