coordinate-descent-attention
coordinate-descent-attention copied to clipboard
Interest in TopK Attention
Hi, I see that you mentioned "I'll keep playing around with topk attention though, because it bothers me that softmax becomes a bottleneck for the tokens far in the future, especially as sequence lengths go above 8k".
I strongly feel the same thing!! Do you mind sharing more thoughts on this? Like any recent cool tricks that works well.
I have been reading all those saliency paper in CV (mostly before 2020), some robustness paper in CV (recently), and all those efficient transformers paper for NLP. All of them are studying how attention contributes or how attention patterns are like. I haven't found anything that can actually work evidently better than n^2 vallina transformer, though intuitively there is a (high, I believe) chane some tricks can make it even better than the vallina one.
Thank you!