MoChA-pytorch
MoChA-pytorch copied to clipboard
PyTorch Implementation of "Monotonic Chunkwise Attention" (ICLR 2018)
Excuse me,is there any trained weights or training code?
I tried this MonotonicAttention in my seq2seq model, which works well with vanilla attention, while after training for a while, it still encountered the Nan grad issue. I checked the...
I think `energy = self.tanh(self.W(encoder_outputs) + self.V(decoder_h).repeat(sequence_length, 1) + self.b)` should be writen as `energy = self.tanh(self.W(encoder_outputs) + self.V(decoder_h).repeat(1,sequence_length).reshape(batch_size*sequence_length,-1) + self.b)`
cumprod in the MoChA paper is defined to be exclusive, while the `safe_cumprod` in this repo does not. Shouldn't it be: ```python def safe_cumprod(self, x, exclusive=False): """Numerically stable cumulative product...
Is the returned attention by MonotonicAttention.soft() a probability distribution? Seems to be not, the following code: ``` from attention import MonotonicAttention monotonic = MonotonicAttention().cuda() batch_size = 1 sequence_length= 5 enc_dim,...