outperformer
outperformer copied to clipboard
How to properly implement attention mask?
Question Here again 😄! Seems hard for me to think how to apply attention mask to fast attention, can you please shed some light on that?
I think I should fill some of the Q'
and K'
to 0 according to the attention_mask, since Q' @ K'.T
equals the matrix A
, but is that correct?
Here's how my code looks like:
query = self.apply_feature_map(query, self.orf)
key = self.apply_feature_map(key, self.orf)
# query/key is now of shape (b * num_attn_heads, L, r)
if attention_mask is not None:
attention_mask = attention_mask == 0 # (b, L)
attention_mask = attention_mask.repeat(1, self.num_attention_heads) # (b, L * num_attn_heads)
attention_mask = attention_mask.view(-1, seq_len)[:, :, None] # (b * num_attn_heads, L, 1)
query.masked_fill_(attention_mask, 0)
key.masked_fill_(attention_mask, 0)
outputs = (self.fast_attention(query, key, value),)
Do you think it's correct?
Hello again :) Sorry I couldn't answer yesterday, I was quite busy between Xmas, work and stuff ^^'
You're right I hadn't taken the time to integrate what's known as the attention_mask
in the transformers library
, although I don't really like the name since it can be confused with e.g. the masking that occurs in non-MLM transformers. I prefer the Pytorch idea of calling it a padding_mask
, since that's what it actually is.
In all cases you're right, since A = Q' @ K'.T
we simply need to nullify the appropriate elements in both matrices to get the same result as with conventional attention. In the transformers library they do it this way in each layer, with attention_mask
equal to -10000 for padding and zero otherwise, that way the softmax operation does it directly. In our case, although we should define the mask only once like they do, we'll apply it to Q'
and K'
as you inferred.
Regarding your code it looks correct at a glance but I want to check a couple things to improve it before integration. I'll be working on it this evening after work, so if you need it to work now you should use it.
Thanks for the input in all cases I'd completely overlooked this :100: I'll close the issue once it's done, if you have any more stuff you want to talk about don't hesitate - and thanks for the comment on the blog post, it means a lot :)
Okay so I may take a bit more time to implement this then I thought, I need to think things through regarding integration with the rest of the code, specifically reversible layers. It's a bit annoying to recreate the mask at each layer but also annoying to pass multiple args. In all cases, I double-checked your code and it's correct, although given the boolean mask of shape (B, L)
you can use the built-in repeat_interleave
to get to the proper mask of shape (B*h, L)
:
mask = mask.repeat_interleave(h, 0)[:, :, None]
q.masked_fill_(mask, 0)
k.masked_fill_(mask, 0)
Cool! Yep masking using big negative number works for Softmax but not in the case of Performer. The only potential issue I find my code could have is masked_fill_
seem to not work for uncontigious tensor, so I probably need apply .contiguous()
to the tensor before mask fill?
Regarding padding_mask
, I'm not sure, sometimes I might want to only use first token of a word and mask the others, in that case attention_mask
sounds like a better name.
Anyway, thanks for your code suggestion! I'm planning to do some experiment to verify the implementation.