gpt-2-simple
gpt-2-simple copied to clipboard
why gpt encoder norm is before mlp while the original transformer is mlp before norm?
https://github.com/minimaxir/gpt-2-simple/blob/master/gpt_2_simple/src/model.py#L158 Original paper: https://arxiv.org/pdf/1706.03762.pdf