flash-attention
flash-attention copied to clipboard
I/O Analysis of BlockSparse FlashAttention
Why is there an extra $Nd$ term in the I/O analysis of block sparse FA? Section D1, page 25
The paper says that you have to write the output O back to HBM when s is small. I don't understand, isn't this true for dense FA too? And when s is small, that means that most of the blocks are empty, so there's less information to write, meaning this $Nd$ term would surely become insignificant since the output is already initialized to all 0s
At the very least, my understanding is that you need the $Nd$ extra term for dense FA too [unless it has been omitted due to the $N^2d^2$ dominating the I/O complexity].
@tridao or anyone else if you could help clear up my confusions, I would greatly appreciate this