TUPE
TUPE copied to clipboard
How to calculate correlation in Figure 2?
Hi, Thanks for your wonderful work. I am unsure about how you've derived the correlation matrix as per figure 2 in terms of the variables used in the calculation as well as the derivation for the correlation matrix.
For instance, does the word-to-word correlation matrix uses the correlation of w_i W^{Q,1} and w_j W^{K,1}T as the variable for the calculation? Also, how do you reduce the dimension for the correlation matrix as the standard correlation calculation only deals with scalar variables?
Thanks!
Hi, @Redaimao
- we use the first self-attention layer to calculate, as the later layers have residuals.
- then, as there is
Dropout(LayerNorm(x))
forword_emb+pos_emb
before transformer. SinceLayerNorm(a + b) != LayerNorm(a) + LayerNorm(b)
, you need to calculate word_emb and pos_emb correctly. - then, in the first self-attention layer, you can calculate four correlation items for word and pos.
- for the final results, we randomly pick a batch (size=32), and average the correlation matrix along batch dimension. then, there are multiple heads, we pick one head for demonstration.
- I think you misunderstand the term
correlation
in our paper, it actually is the attention scores (logits before softmax).