NTK4A
NTK4A copied to clipboard
About transformer attention scaling
Thank you very much for this great work!
Regarding the calculation here: https://github.com/thegregyang/NTK4A/blob/master/Transformer-NTK.ipynb
May I ask why the attention with key-query scaling $1/d_{head} = 1/n$ is used, instead of $1/\sqrt{d_{head}}$?