practical-pytorch icon indicating copy to clipboard operation
practical-pytorch copied to clipboard

question of 'context vector' in seq2seq-translation/seq2seq-translation-batched.ipynb

Open seabay opened this issue 7 years ago • 7 comments

Hi, all I have confusion about this:

decoder_hidden = encoder_hidden[:decoder_test.n_layers] # Use last (forward) hidden state from encoder,

should this be

decoder_hidden = encoder_hidden[decoder_test.n_layers:] ? Because now it is the hidden state of the second layer.

seabay avatar Dec 20 '17 21:12 seabay

Hi @seabay

You might misunderstand what the hidden state of the 1st layer and the hidden state of the 2nd layer is.

Use encoder_hidden[:decoder_test.n_layers], we extract the normal time order hidden state --->, while encoder_hidden[decoder_test.n_layers:] gives us the reverse time order hidden state <---.

Though in my opinion, using which one might does not really matter, it's more common to use normal time order hidden state of Bi-RNN.

Hope it helps.

iamkissg avatar Jan 05 '18 17:01 iamkissg

Hi @Engine-Treasure I think that number of layers is nothing related to Bi-direction, for example, encoder is a 2-layered Bi-RNN, so the hidden state has (2 * 2)=4 parts, the first two parts are the forward and backward state of layer 1, the last two parts are for layer 2.

So the question is: do we use the hidden state of the layer 1 or layer 2?

seabay avatar Jan 05 '18 18:01 seabay

Hi @seabay

So sorry for my mistaking num_layers and hidden_size.

Then here comes another question, does forward hidden and backward hidden alternate in layers or forward hidden comes first?

[
layer0_forward
layer0_backward
layer1_forward
layer1_backward
] or
[
layer0_forward
layer1_forward
layer0_backward
layer1_backward
]

You can find some answers here

@spro 's answer is they alternates in layers. However the code which we're talking about seems does not match the answer.

I just get more confused, :(

iamkissg avatar Jan 06 '18 05:01 iamkissg

hi @Engine-Treasure Based on my experiments, codes match the first layout which alternates in layers. But why @spro choose the first layer as the context vector for Decoder?

seabay avatar Jan 06 '18 15:01 seabay

This won't be a very satisfying answer, but I believe the reason is just that this is left over from a non-bidirectional encoder, and this slicing was a workaround to make it fit the decoder. The batched version is still very much a work in progress (despite the lack of recent progress).

Two better solutions would be:

  • Doubling the size of the decoder's hidden units to accept the whole set of encoder hidden states.
  • Summing the forward and backward encoder hidden states before feeding to decoder (similar to what is done with encoder outputs as the last step of the Encoder model).

spro avatar Jan 07 '18 19:01 spro

I have the same question and agree with the pattern 1 based on intuition. But through my experiments I found out the following:

https://discuss.pytorch.org/t/gru-output-and-h-n-relationship/12720

zhongpeixiang avatar Jan 24 '18 07:01 zhongpeixiang

@spro I think summing up forward and backward hidden state at the last position of encoder would not be a good idea because the backward hidden state at last position contains little information about the sentence.

zhongpeixiang avatar Jan 24 '18 07:01 zhongpeixiang