bert
bert copied to clipboard
Why the attention mask of `from_tensor` is not used?
https://github.com/google-research/bert/blob/ffbda2a1aafe530525212d13194cc84d92ed0313/modeling.py#L524
In this function, it says that
'We don't assume that from_tensor
is a mask (although it could be). We
don't actually care if we attend from padding tokens (only to padding)
tokens so we create a tensor of all ones.'
I don't quite get the idea. The final attention will get non-zero embeddings for paddings in the 'query'. That is to say, paddings in the query sequence will also get an attention embedding which does not make sense. Is there any postprocessing that will ignore them?
分类问题使用的是CLS_Embedding,所以是否Mask掉无所谓了。 我是这样理解的: 如果把self-attention-layer想象成黑盒,输入、输出的维度一致,那么黑盒计算过程需要Mask掉就是KV的Padding_Tokens