ESRT
ESRT copied to clipboard
what is the difference between Feature Split (FS) in EMHA and window-attention?
Thank you for your work.
after reading your paper, I have a question.
In Feature Split (FS) of sec. 3.2.2 Efficient Transformer, I was confused with the difference between this FS and window-attention (from Swin-Transformer).
Your FS splits the features into N/s x N/s, and window-attention (of Swin-Transformer) splits the features into N/M x N/M, where M is window size.
self-attention is calculated within N/s x N/s (by FS) and N/M x N/M (by window-partitioning), respectively.
s (in FS) and M (in window-attention) can be different in that the values differ, but I don't understand the mechanism differences between them.
Once more, thank you for your hard work.
do you got answer