pytorch-seq2seq Masked attention

Masked attention

Open lethienhoa opened this issue 6 years ago • 4 comments

Hi, I see that this implementation is lacking masked attention on encoder. Input_lengths should be passed to decoder (not just encoder) in order to compute this. OpenNMT already provided this in function sequence_mask. Best,

May 09 '18 19:05 lethienhoa

@lethienhoa why do you need masked attention, if you mask the loss ?

Jul 10 '18 10:07 erogol

I just noticed the same thing and landed here. The attention mechanism should only include those encoder outputs in the weighted sum that correspond to valid tokens in the input sequences. For example, if your input lengths in the batch are 23, 12, 7. Then for the third element in the batch, the attention should compute the weighted sum over the 7 encoder outputs, rather than all 23.

Normally your attention would learn to ignore the extra encoder outputs anyway, but this might pose a problem if you train and test with different maximum sentence sizes.

Jul 10 '18 16:07 valtsblukis

@valtsblukis thx for explaining it. Yes that was my understading too, but I'd also assume the model would learn it anyways. I am also performing an experiment with my model with/without masking too see the difference.

Jul 12 '18 12:07 erogol

@lethienhoa I'll see to it. Thanks for pointing this out.

Sep 01 '18 11:09 pskrunner14

pytorch-seq2seq pytorch-seq2seq copied to clipboard

Masked attention

pytorch-seq2seq
pytorch-seq2seq copied to clipboard