g2-lstm What is B?

I am trying to follow your code but here is where I get lost:

            self.B = input_.data.new(input_.size()).bernoulli_(self.p)
            self.noise = self.U * self.B

What is the purpose of B? To simulate some kind of dropout for the noise? Is it mentioned in the paper somewhere?

Thanks in advance.

source: https://github.com/zhuohan123/g2-lstm/blob/master/language-modeling/g2_lstm.py#L42

Aug 08 '18 13:08 felixhao28

I think his code is totally different from the paper.

Aug 15 '18 00:08 wenhuchen

It is dropout applied to the Gumbel noise. Please check the README for the detail.

Aug 15 '18 02:08 zhuohan123

Thanks. Somehow I missed that part in readme.

In our experiment, we arbitrarily set p=0.5 but the loss stopped decreasing after a few epochs. Then we completely removed self.B and then the training can continue as normal. In the end, the outputs of the LSTM gates are more skewed towards a Bernoulli distribution (0 and 1) than it did previously, but the end to end accuracy was a just little lower comparing to using plain LSTM. So my conclusion is that G2-LSTM is not a universal drop-in improvement for every task. The idea is very profound though.

Mathematically, does it even make sense to apply such dropout to the Gumbel noise? Randomly subtracting a portion from some of the population will just create two distribution.

And just out of curiosity, have you tried applying the same trick to GRU gates?

Aug 15 '18 03:08 felixhao28

g2-lstm g2-lstm copied to clipboard

What is B?

g2-lstm
g2-lstm copied to clipboard