dilated-attention-pytorch Q: Attention Calculation

Q: Attention Calculation

Open mohamedelbahnasawi opened this issue 1 year ago • 5 comments

Hi @fkodom,

I really like your implementation and I wanted to use dilated attention into a vanilla transformer model to try how things work.

Right now, I am facing a problem during the attention calculation in which you use flash attention because they do not include a way to provide padding mask. For the scaled dot product, I am not sure If the masks also should be segmented and sparsified. Do you have an idea how calculated the attention using the scaled dot product taking into consideration the padding mask for encoder mask and padding and causal mask together for the decoder mask?

Thanks for help!

Oct 30 '23 07:10 mohamedelbahnasawi

dilated-attention-pytorch dilated-attention-pytorch copied to clipboard

Q: Attention Calculation

dilated-attention-pytorch
dilated-attention-pytorch copied to clipboard