bert Why the attention mask of `from

Why the attention mask of `from_tensor` is not used?

Open haozheji opened this issue 5 years ago • 1 comments

https://github.com/google-research/bert/blob/ffbda2a1aafe530525212d13194cc84d92ed0313/modeling.py#L524

In this function, it says that 'We don't assume that from_tensor is a mask (although it could be). We don't actually care if we attend from padding tokens (only to padding) tokens so we create a tensor of all ones.'

I don't quite get the idea. The final attention will get non-zero embeddings for paddings in the 'query'. That is to say, paddings in the query sequence will also get an attention embedding which does not make sense. Is there any postprocessing that will ignore them?

Mar 05 '19 07:03 haozheji

分类问题使用的是CLS_Embedding，所以是否Mask掉无所谓了。我是这样理解的: 如果把self-attention-layer想象成黑盒，输入、输出的维度一致，那么黑盒计算过程需要Mask掉就是KV的Padding_Tokens

Dec 27 '21 09:12 cuixuage

bert bert copied to clipboard

Why the attention mask of `from_tensor` is not used?

bert
bert copied to clipboard