d2l-en Why AddNorm() in the Transformer 9.3.3 (line: `self.norm(self.dropout(Y) + X)` ) is different from the Original one (line :` x + self.dropout(sublayer(self.norm(x))) ` ) ?

Why AddNorm() in the Transformer 9.3.3 (line: `self.norm(self.dropout(Y) + X)` ) is different from the Original one (line :` x + self.dropout(sublayer(self.norm(x))) ` ) ?

Open kaharjan opened this issue 4 years ago • 1 comments

In The Annotated Transformer the residual connection is implemented like that : x + self.dropout(sublayer(self.norm(x))) However, in the section 9.3.3, it is implemented in this way: self.norm(self.dropout(Y) + X) . I think it should be X + self.norm(self.dropout(Y)) am I correct?

Mar 24 '20 06:03 kaharjan

Hey @kaharjan ! If we carefully read the paper (https://arxiv.org/abs/1706.03762), in section 3.1, it states "We employ a residual connection around each of the two sub-layers, followed by layer normalization". Hence, it should be an 'Add' followed by a 'Norm', i.e.,

Add: self.dropout(Y) + X
Norm: self.norm()

There is not an extra "X + ". Please ask your question on our forum (https://discuss.d2l.ai/)

Jun 08 '20 17:06 goldmermaid

d2l-en d2l-en copied to clipboard

Why AddNorm() in the Transformer 9.3.3 (line: `self.norm(self.dropout(Y) + X)` ) is different from the Original one (line :` x + self.dropout(sublayer(self.norm(x))) ` ) ?

d2l-en
d2l-en copied to clipboard