transformer
transformer copied to clipboard
[Bug] Dropout should comes before residual connection and layer norm
In section 5.4 of the original paper:
We apply dropout to the output of each sub-layer, before it is added to the sub-layer input and normalized.