efficientvit Why padding in attention?

https://github.com/mit-han-lab/efficientvit/blob/f47de541358135d646779c90c64cd698cddd5394/efficientvit/models/nn/ops.py#L425-L428

Just wondering why padding one element in v and then dropping the last element of out.

Sep 25 '23 14:09 TsingWei

Have you figured it out? To me, the line 428 is somehow another form of equation(3) in the paper. out[..., :-1] is equal to QK^TV, and out[..., -1:] is the QK^T1, since in line 425, last channel of v is padded with all 1, K(V[-1:])=K. I don't have strict mathematical proof, but I think it is the basic idea of this implementation trick.

Oct 04 '23 21:10 weiyi1991

I have the same confusion.

Oct 16 '23 08:10 Boomerl

Have you figured it out? To me, the line 428 is somehow another form of equation(3) in the paper. out[..., :-1] is equal to QK^TV, and out[..., -1:] is the QK^T1, since in line 425, last channel of v is padded with all 1, K(V[-1:])=K. I don't have strict mathematical proof, but I think it is the basic idea of this implementation trick.

It makes sense but I have no idea about what this trick is made for. out = out[..., :-1] / (out[..., -1:] + self.eps) is actually $out = \frac{out}{1+eps}$, where eps is a small number.

Oct 31 '23 06:10 TsingWei

Have you figured it out? To me, the line 428 is somehow another form of equation(3) in the paper. out[..., :-1] is equal to QK^TV, and out[..., -1:] is the QK^T1, since in line 425, last channel of v is padded with all 1, K(V[-1:])=K. I don't have strict mathematical proof, but I think it is the basic idea of this implementation trick.

I agree！

Oct 31 '23 09:10 haiduo

efficientvit efficientvit copied to clipboard

Why padding in attention?

efficientvit
efficientvit copied to clipboard