SAT icon indicating copy to clipboard operation
SAT copied to clipboard

Question about Eq. (4) in the paper.

Open xiyiyia opened this issue 1 year ago • 1 comments

Question: Why use dot product for Q and K in Eq. (4)?

Describe: I read the paper "GraphiT: Encoding Graph Structure in Transformers", and the Eq. (5) in this paper, it clearly describe the QQ^T, which means it is a matrix multiplication operations. And in their code, they use function torch.bmm() to process the Q and K. I am confused that you cite this paper, but the function in section 3.1 is not the same with this paper.

@claying @lobray Thanks a lot!

xiyiyia avatar Sep 20 '22 06:09 xiyiyia

Hi @xiyiyia In "GraphiT: Encoding Graph Structure in Transformers", my co-authors and I used the kernel smoothing formulation introduced in https://arxiv.org/pdf/1908.11775.pdf, originally proposed for NLP. We generalized the formulation to deal with graph-structured data. In order to simplify the formulation, we only considered a symmetric positive definite kernel, which requires the query and key matrix weights (W_q and W_k) to be the same. In any case, you can always write Eq.(4) as a sum of dot products by looking at the output of each node individually.

In our SAT paper, Leslie (@lobray) and I proposed a new class of attentions to account for structural interaction between nodes. As a generalization of dot-product self-attention, we did not restrict W_q and W_k to be the same, which leads to similar or better performance than when W_q=W_k.

Hope this clears up your confusion.

Dexiong

claying avatar Sep 20 '22 08:09 claying