Transformer Misinterpreted multi head attention

Misinterpreted multi head attention

Open ghost opened this issue 6 years ago • 0 comments

Hi, I think you misinterpreted the multi head attention in Vaswani's Attention is all you need paper.

What you do, is (assume only one query) projecting the query and keys and values once, separating them into sections (heads), and apply attention to the heads separately.

However imo the paper says, that you have nr_heads * 3 separate projections (so 3 set of weights per head), you do the projection, apply (again nr_heads times) the attention. Then you concatenate the results and project them back to the appropriate size.

Let me know what you think. Otherwise your post on towards data science is very helpful for me to learn pytroch. Best regards, Zoltán

Jan 30 '19 12:01 ghost

Transformer Transformer copied to clipboard

Misinterpreted multi head attention

Transformer
Transformer copied to clipboard