attention-is-all-you-need-pytorch
attention-is-all-you-need-pytorch copied to clipboard
why none pad mask is nedd
您好,感谢您的代码,有两点疑惑请您看一下: 1、为什么需要none_pad_mask: 以编码侧为例,由于attention mask的存在,那么每一次attention都会屏蔽掉padding的位置(权重为0),那么为什么attention之后还需要none_pad_mask处理呢?就算考虑线性层和layerNorm也不需要吧,因为这两个操作都是以单个word为单位,padding不会影响到正常词语,我觉得在最后预测的时候把none_pad_mask加上处理一下就行了吧。 2、为什么embedding共享的时候需要乘以缩放系数? 即:seq_logit = self.tgt_word_prj(dec_output) * self.x_logit_scale
- Why do you need none_ pad_ mask: Take the Encoder as an example. Due to the presence of the attention mask, the padding position will be masked every time (the attention weight is 0). So why do we need none_ pad_ mask after the Multihead attention? Even if we consider the linear layer and layernorm, we don't need to, because these two operations are based on a single word. Padding will not affect the normal words. I think we only need none_ pad_ mask in the final prediction.
- Why is it necessary to multiply the scaling factor when embedding shares? Namely: seq_ logit = self.tgt_ word_ prj(dec_ output) * self.x_ logit_ scale
- 问题1:none_pad_mask只是为了统计正确词的个数时,防止输出<pad>也算成正确的词 举个例子: trg: I love you . <eos> <pad> <pad> <pad> <pad> pred: I love you . <eos> <pad> <pad> <pad> <pad> 正确词数是5而不是9
- 问题2:论文中明确写着的一个小trick,大家普遍认为是用于弱化位置编码的