outperformer How to properly implement attention mask?

Question Here again 😄! Seems hard for me to think how to apply attention mask to fast attention, can you please shed some light on that?

I think I should fill some of the Q' and K' to 0 according to the attention_mask, since Q' @ K'.T equals the matrix A, but is that correct?

Dec 15 '20 08:12 codars

Here's how my code looks like:

query = self.apply_feature_map(query, self.orf)
key = self.apply_feature_map(key, self.orf)

# query/key is now of shape (b * num_attn_heads, L, r)

if attention_mask is not None:
    attention_mask = attention_mask == 0 # (b, L)
    attention_mask = attention_mask.repeat(1, self.num_attention_heads) # (b, L * num_attn_heads)
    attention_mask = attention_mask.view(-1, seq_len)[:, :, None] # (b * num_attn_heads, L, 1)
    query.masked_fill_(attention_mask, 0)
    key.masked_fill_(attention_mask, 0)

outputs = (self.fast_attention(query, key, value),)

Do you think it's correct?

Dec 15 '20 17:12 codars

Hello again :) Sorry I couldn't answer yesterday, I was quite busy between Xmas, work and stuff ^^'

You're right I hadn't taken the time to integrate what's known as the attention_mask in the transformers library, although I don't really like the name since it can be confused with e.g. the masking that occurs in non-MLM transformers. I prefer the Pytorch idea of calling it a padding_mask, since that's what it actually is.

In all cases you're right, since A = Q' @ K'.T we simply need to nullify the appropriate elements in both matrices to get the same result as with conventional attention. In the transformers library they do it this way in each layer, with attention_mask equal to -10000 for padding and zero otherwise, that way the softmax operation does it directly. In our case, although we should define the mask only once like they do, we'll apply it to Q' and K' as you inferred.

Regarding your code it looks correct at a glance but I want to check a couple things to improve it before integration. I'll be working on it this evening after work, so if you need it to work now you should use it.

Thanks for the input in all cases I'd completely overlooked this :100: I'll close the issue once it's done, if you have any more stuff you want to talk about don't hesitate - and thanks for the comment on the blog post, it means a lot :)

Dec 16 '20 13:12 r0mainK

Okay so I may take a bit more time to implement this then I thought, I need to think things through regarding integration with the rest of the code, specifically reversible layers. It's a bit annoying to recreate the mask at each layer but also annoying to pass multiple args. In all cases, I double-checked your code and it's correct, although given the boolean mask of shape (B, L) you can use the built-in repeat_interleave to get to the proper mask of shape (B*h, L):

mask = mask.repeat_interleave(h, 0)[:, :, None]
q.masked_fill_(mask, 0)
k.masked_fill_(mask, 0)

Dec 16 '20 18:12 r0mainK

Cool! Yep masking using big negative number works for Softmax but not in the case of Performer. The only potential issue I find my code could have is masked_fill_ seem to not work for uncontigious tensor, so I probably need apply .contiguous() to the tensor before mask fill?

Regarding padding_mask, I'm not sure, sometimes I might want to only use first token of a word and mask the others, in that case attention_mask sounds like a better name.

Anyway, thanks for your code suggestion! I'm planning to do some experiment to verify the implementation.

Dec 17 '20 02:12 codars

outperformer outperformer copied to clipboard

How to properly implement attention mask?

outperformer
outperformer copied to clipboard