transformer
transformer copied to clipboard
Why doing Key and Query Masking ?
Nice Job! But in paper doesn't describe Key and Query Masking , can you give me some hint about that ? thanks!
The encoder leaves artifacts, which are non-zeros, for paddings. It doesn't make sense, a query attends to those, so before applying softmax for getting score, I overwrote them with very very small numbers. As a results they will have score 0. Likewise, queries for paddings should not have any values, so they are masked with zeros.
In the figure 2 of paper, there is an optional masking layer. It is important to mask if the source contains very different sequence lengths. This project very seriously deals with masking, in Key, Query and Losses (almost everywhere possible). Great works, thanks!
The usage of the masking is judged by whether the summation of the keys or queries at the last dimension is zero. However, the padding part, which is originally embedded with zeros, is added by positional embedding, so they will never be zeros. Thanks!!!!
@darongliu You have raised a very important point. We have to create a mask before the positional encoding, otherwise the the key and query padding will never work. Thanks!
@zhedongzheng Does this mean that this piece has a problem?
No, I think it will not cause great performance drop. It is still a good implementation.
@darongliu Doesn't that mean that the query mask and key mask are useless?
Query masking is unnecessary? cause the padded query will be masked out by next block's key masking?
@darongliu Doesn't that mean that the query mask and key mask are useless?
It's not working. Check https://github.com/Kyubyong/transformer/issues/33