dilated-attention-pytorch
dilated-attention-pytorch copied to clipboard
Q: Attention Calculation
Hi @fkodom,
I really like your implementation and I wanted to use dilated attention into a vanilla transformer model to try how things work.
Right now, I am facing a problem during the attention calculation in which you use flash attention because they do not include a way to provide padding mask. For the scaled dot product, I am not sure If the masks also should be segmented and sparsified. Do you have an idea how calculated the attention using the scaled dot product taking into consideration the padding mask for encoder mask and padding and causal mask together for the decoder mask?
Thanks for help!