efficientvit
efficientvit copied to clipboard
Why padding in attention?
https://github.com/mit-han-lab/efficientvit/blob/f47de541358135d646779c90c64cd698cddd5394/efficientvit/models/nn/ops.py#L425-L428
Just wondering why padding one element in v
and then dropping the last element of out
.
Have you figured it out? To me, the line 428 is somehow another form of equation(3) in the paper. out[..., :-1]
is equal to QK^TV, and out[..., -1:]
is the QK^T1, since in line 425, last channel of v is padded with all 1, K(V[-1:])=K. I don't have strict mathematical proof, but I think it is the basic idea of this implementation trick.
I have the same confusion.
Have you figured it out? To me, the line 428 is somehow another form of equation(3) in the paper.
out[..., :-1]
is equal to QK^TV, andout[..., -1:]
is the QK^T1, since in line 425, last channel of v is padded with all 1, K(V[-1:])=K. I don't have strict mathematical proof, but I think it is the basic idea of this implementation trick.
It makes sense but I have no idea about what this trick is made for. out = out[..., :-1] / (out[..., -1:] + self.eps)
is actually $out = \frac{out}{1+eps}$, where eps is a small number.
Have you figured it out? To me, the line 428 is somehow another form of equation(3) in the paper.
out[..., :-1]
is equal to QK^TV, andout[..., -1:]
is the QK^T1, since in line 425, last channel of v is padded with all 1, K(V[-1:])=K. I don't have strict mathematical proof, but I think it is the basic idea of this implementation trick.
I agree!