do-you-even-need-attention
do-you-even-need-attention copied to clipboard
Interaction between patches through a transpose may have a stronger role to play ?
Hi, I was going through your exp report. You have made a point that since you are able to get a good performance without using attention layer so good performance of ViT may be more to do with it's embedding layer than attention .
But I believe, It's also may be to do with how you have established an interaction between patches through a transpose very similar to what was done in MLP-Mixer .
Would love to know your thoughts on this ?