Compact-Transformers
Compact-Transformers copied to clipboard
Order of `LayerNorm` & `Residual`
First of all, thanks for your amazing work!
And it seems that your TransformerEncoderLayer implementation is a bit different from the 'mainstream' implementations, because you create your residual link after the LayerNorm procedure:
https://github.com/SHI-Labs/Compact-Transformers/blob/3f3d093746bc58213d9e9af4431242d305717855/src/utils/transformers.py#L96-L99
However, from the original paper of ViT and many other implementations, the residual link is created before the LayerNorm procedure:
src = src + self.drop_path(self.self_attn(self.pre_norm(src)))
src2 = self.norm1(src)
src2 = self.linear2(self.dropout1(self.activation(self.linear1(src2))))
src = src + self.drop_path(self.dropout2(src2))
I'm just wondering whether this is on purpose or some kind of 'typo'? Thanks in advance!
Hi, thank you for your interest.
First off, great catch! This tiny difference was a result of a bad merge in a very old version of our repository.
We were based on PyTorch's nn.TransformerEncoderLayer (at the time did not support pre_norm so we wrote our own), and once we merged branches with the new implementation, those two lines must have not merged correctly.
It should in fact be src2 = self.norm1(src) and src2 should be fed to linear1 in the following line.
However, after running a couple of experiments, we did not find that this change impacts the performance of the network to a noticeable degree. Most runs were well within the margin of error.
Changing this now will break our current checkpoints and we want to maintain reproducibility. We will continue to investigate the issue.
Again, thank you for bringing this to our attention!
Closing due to inactivity.
Just to note: pre-norm vs post-norm is mostly about stability. Pre-norm gradients are smaller than post-norm gradients.