Transformer
Transformer copied to clipboard
Misinterpreted multi head attention
Hi, I think you misinterpreted the multi head attention in Vaswani's Attention is all you need paper.
What you do, is (assume only one query) projecting the query and keys and values once, separating them into sections (heads), and apply attention to the heads separately.
However imo the paper says, that you have nr_heads * 3 separate projections (so 3 set of weights per head), you do the projection, apply (again nr_heads times) the attention. Then you concatenate the results and project them back to the appropriate size.
Let me know what you think. Otherwise your post on towards data science is very helpful for me to learn pytroch. Best regards, Zoltán