MoChA-pytorch
MoChA-pytorch copied to clipboard
Questions about MonotonicAttention.soft
Is the returned attention by MonotonicAttention.soft() a probability distribution?
Seems to be not, the following code:
from attention import MonotonicAttention
monotonic = MonotonicAttention().cuda()
batch_size = 1
sequence_length= 5
enc_dim, dec_dim = 10, 10
prev_attention = None
for t in range(5):
encoder_outputs = torch.randn(batch_size, sequence_length, enc_dim).cuda()
decoder_h = torch.randn(batch_size, dec_dim).cuda()
attention = monotonic.soft(encoder_outputs, decoder_h, previous_alpha=prev_attention)
prev_attention = attention
# probability distribution ?
print(torch.sum(attention, dim=-1).detach().cpu().numpy())
returns:
[1.]
[0.0550258]
[0.00664481]
[0.00043618]
[4.0174375e-05]
If it was a probability distribution like softmax, every row should return 1 or ?. The consequence is my alignments look like this image:

So my questions are:
- Is the returned attention by MonotonicAttention.soft() a probability distribution?
- if not, is it possible to convert to one?
Hi, monotonic attention (and MoChA) produces the probability of attending to each of the encoder states or skipping all of the encoder states. As a result the sum of the probability of attending to the encoder states need not be zero; the additional probability corresponds to the probability of skipping all the encoder states. This is discussed in section 2.3 of the monotonic attention paper.
However, note that α_i may not be a valid probability distribution because \sum_j α_{i,j} ≤ 1. Using α_i as-is, without normalization, effectively associates any additional probability not allocated to memory entries to an additional all-zero memory location.