multigen 您好，请问一个不是该模型的问题，也是您曾经在goole-bert提问过的attention mask的问题。

您好，请问一个不是该模型的问题，也是您曾经在goole-bert提问过的attention mask的问题。

Open Yesgo1220 opened this issue 3 years ago • 1 comments

在bert源码create_attention_mask_from_input_mask中，We don't assume that from_tensor is a mask (although it could be). We don't actually care if we attend from padding tokens (only to padding) tokens so we create a tensor of all ones.这里Query的padding也会得到没有意义的attention scores,后面是否有处理掉他们呢？困扰很久了，感谢

Apr 15 '21 03:04 Yesgo1220

和bert不一样，gpt因为是解码器所以attention mask是下三角矩阵而不是全1的。对于序列最后的padding因为不会在对应输出端施加loss因此不会影响前面有意义的token。

Apr 23 '21 03:04 haozheji

multigen multigen copied to clipboard

您好，请问一个不是该模型的问题，也是您曾经在goole-bert提问过的attention mask的问题。

multigen
multigen copied to clipboard