acnn
acnn copied to clipboard
attention first or convolution first
Hi, my implementation is similar as yours. In input attention layer, I did convolution of kernel size 3 first and then multiply with the attention. I didn't see a mathematical difference between this version and the sliding window version. What's your opinion on it?