torch-rnn icon indicating copy to clipboard operation
torch-rnn copied to clipboard

How to do back propagation if I only use specific hidden state?

Open chingyaoc opened this issue 9 years ago • 5 comments

Normally the output of LSTM will be like [N, T, H], N is batch size, T is length and H is hidden state size. But I only use specific hidden state (ex: the last hidden state for sentence encoding). I'm wondering how to do back propagation since the gradient to feed in the LSTM should have the same size as input ([N, T, H]) but I only got ([N,H] or [N,1,H]) for one hidden state.

chingyaoc avatar Jun 29 '16 07:06 chingyaoc

To backprop on only a single hidden state you will need to construct a gradient tensor of shape N x T x H which is all zero except in the slot corresponding to the timestep you want to backprop on.

Depending on exactly what you want to do, one option to achieve this would be to feed the output of the LSTM layer to an nn.Index module; then on the forward pass the Index module will pluck out the desired hidden state, and on the backward pass will produce a gradient tensor with mostly zeros as described above.

jcjohnson avatar Jun 30 '16 00:06 jcjohnson

Thanks for the reply. I use LSTM to encode a sentence, plucking out the hidden state with corresponding sentence length. The first method you mentioned sound reasonable, can I use something like "mask" to achieve the same result?

chingyaoc avatar Jun 30 '16 03:06 chingyaoc

Hi all, I test using mask and padding "all zero" on a visual question answering model and found out that the performance of masking is way better than zero padding. Is that reasonable?

chingyaoc avatar Jul 05 '16 09:07 chingyaoc

I don't quite understand exactly what you mean by masking vs zero padding; can you explain the difference?

jcjohnson avatar Jul 05 '16 16:07 jcjohnson

Sorry for late reply. I construct a mask like 0 0 1 ... 0 0 1 0 ... 0 ...
0 0 0 ... 1 each row only has one "1". Then I multiply it with the the output sequence of LSTM to get the state I want. (using nn.MM) I am quite surprised to found that the performance is different between masking(nn will back propagate automatically) and padding "zero" to the gradient while doing back propagation. Btw, how can I get both hidden states from layer1 to layer if I have multiple LSTM layers? Should I construct something like gModule to get multiple hidden state?

chingyaoc avatar Jul 11 '16 13:07 chingyaoc