annotated-transformer
annotated-transformer copied to clipboard
Some doubts about SublayerConnection
According to what you wrote:
“That is, the output of each sub-layer is $\mathrm{LayerNorm}(x + \mathrm{Sublayer}(x))$, where $\mathrm{Sublayer}(x)$ is the function implemented by the sub-layer itself. We apply dropout (cite) to the output of each sub-layer, before it is added to the sub-layer input and normalized.”
I think the rentun value should be self.norm(x + self.dropout(sublayer(x)))
rather than x + self.dropout(sublayer(self.norm(x)))
.
Look forward to your reply.
Where do we write x + self.dropout(sublayer(self.norm(x)))
? That's not what the passage you quote says.
In the_annotated_transformer.py
on line 357
. In the function documentation it even says that the norm was moved.
I have the same question as you. The explanation can be found in #92 .
Maybe it's best to mention this issue in the notebook, because it causes confusion for many.
Where do we write
x + self.dropout(sublayer(self.norm(x)))
? That's not what the passage you quote says.
https://github.com/harvardnlp/annotated-transformer/blob/debc9fd747bb2123160a98046ad1c2d4da44a567/the_annotated_transformer.py#L357