SAT
SAT copied to clipboard
Question about Eq. (4) in the paper.
Question: Why use dot product for Q and K in Eq. (4)?
Describe:
I read the paper "GraphiT: Encoding Graph Structure in Transformers", and the Eq. (5) in this paper, it clearly describe the QQ^T
, which means it is a matrix multiplication operations. And in their code, they use function torch.bmm()
to process the Q and K. I am confused that you cite this paper, but the function in section 3.1 is not the same with this paper.
@claying @lobray Thanks a lot!
Hi @xiyiyia In "GraphiT: Encoding Graph Structure in Transformers", my co-authors and I used the kernel smoothing formulation introduced in https://arxiv.org/pdf/1908.11775.pdf, originally proposed for NLP. We generalized the formulation to deal with graph-structured data. In order to simplify the formulation, we only considered a symmetric positive definite kernel, which requires the query and key matrix weights (W_q and W_k) to be the same. In any case, you can always write Eq.(4) as a sum of dot products by looking at the output of each node individually.
In our SAT paper, Leslie (@lobray) and I proposed a new class of attentions to account for structural interaction between nodes. As a generalization of dot-product self-attention, we did not restrict W_q and W_k to be the same, which leads to similar or better performance than when W_q=W_k.
Hope this clears up your confusion.
Dexiong