MoChA-pytorch icon indicating copy to clipboard operation
MoChA-pytorch copied to clipboard

Questions about MonotonicAttention.soft

Open tugstugi opened this issue 7 years ago • 1 comments

Is the returned attention by MonotonicAttention.soft() a probability distribution?

Seems to be not, the following code:

from attention import MonotonicAttention

monotonic = MonotonicAttention().cuda()

batch_size = 1
sequence_length= 5
enc_dim, dec_dim = 10, 10
prev_attention = None
for t in range(5):
    encoder_outputs = torch.randn(batch_size, sequence_length, enc_dim).cuda()
    decoder_h = torch.randn(batch_size, dec_dim).cuda()
    attention = monotonic.soft(encoder_outputs, decoder_h, previous_alpha=prev_attention)
    prev_attention = attention
    # probability distribution ?
    print(torch.sum(attention, dim=-1).detach().cpu().numpy())

returns:

[1.]
[0.0550258]
[0.00664481]
[0.00043618]
[4.0174375e-05]

If it was a probability distribution like softmax, every row should return 1 or ?. The consequence is my alignments look like this image: alignment_img

So my questions are:

  • Is the returned attention by MonotonicAttention.soft() a probability distribution?
  • if not, is it possible to convert to one?

tugstugi avatar Aug 26 '18 14:08 tugstugi

Hi, monotonic attention (and MoChA) produces the probability of attending to each of the encoder states or skipping all of the encoder states. As a result the sum of the probability of attending to the encoder states need not be zero; the additional probability corresponds to the probability of skipping all the encoder states. This is discussed in section 2.3 of the monotonic attention paper.

However, note that α_i may not be a valid probability distribution because \sum_j α_{i,j} ≤ 1. Using α_i as-is, without normalization, effectively associates any additional probability not allocated to memory entries to an additional all-zero memory location.

craffel avatar Feb 07 '19 00:02 craffel