ESRT what is the difference between Feature Split (FS) in EMHA and window-attention?

what is the difference between Feature Split (FS) in EMHA and window-attention?

Open rami0205 opened this issue 2 years ago • 1 comments

Thank you for your work.

after reading your paper, I have a question.

In Feature Split (FS) of sec. 3.2.2 Efficient Transformer, I was confused with the difference between this FS and window-attention (from Swin-Transformer).

Your FS splits the features into N/s x N/s, and window-attention (of Swin-Transformer) splits the features into N/M x N/M, where M is window size.

self-attention is calculated within N/s x N/s (by FS) and N/M x N/M (by window-partitioning), respectively.

s (in FS) and M (in window-attention) can be different in that the values differ, but I don't understand the mechanism differences between them.

Once more, thank you for your hard work.

Sep 03 '22 18:09 rami0205

do you got answer

Jul 15 '24 06:07 zeenat-fatima-IIT

ESRT ESRT copied to clipboard

what is the difference between Feature Split (FS) in EMHA and window-attention?

ESRT
ESRT copied to clipboard