[Enhancement] Better Relative Positional Encoding (no padding needed)
Is your feature request related to a problem? Please describe. The implementation of Relative Positional Encoding are all included 0 paddings, which turns out to be not necessary.
Describe the solution you'd like Relative Positional Encoding without padding
Describe alternatives you've considered I tested this idea myself, and it can be adopted for Halo Attention in HaloNet as well. In the solution when padding is not used, the code and slicing is actually easier and more easy-to-understand.
Additional context
Here is my implementation without padding for halo attention when h=1. Could you please check is h != 1 needed and does the implementation need to be changed for h != 1?
I also have a question on the correctness of line 90. https://github.com/rwightman/pytorch-image-models/blob/07eb2f12005f75be3ed6c2394f3512e7a8ac640a/timm/models/layers/halo_attn.py#L84-L90
Should the permute_mask for height dimension be (0, 1, 3, 4, 2) instead of (0, 3, 1, 4, 2)? Since the role of H and W is already transposed in q (line89). If I'm wrong, do you care to explain it to me? I couldn't wrap my head around it T_T.
Thanks!