minGPT
minGPT copied to clipboard
Output of CausalSelfAttention
It seems that the output of this block is simply reshaped from multiple heads. From the original "Attention is all you need" paper, it seems that there is another linear layer, W^O. Mind if I ask if this is intentional or an error? Thank you