NTK4A About transformer attention scaling

About transformer attention scaling

Open chenwydj opened this issue 3 years ago • 0 comments

Thank you very much for this great work!

Regarding the calculation here: https://github.com/thegregyang/NTK4A/blob/master/Transformer-NTK.ipynb

May I ask why the attention with key-query scaling $1/d_{head} = 1/n$ is used, instead of $1/\sqrt{d_{head}}$?

Apr 09 '21 03:04 chenwydj