pytorch-seq2seq
pytorch-seq2seq copied to clipboard
tut 6 - when slicing the <eos> token off from trg before feeding it into the model
Hi, I would like to ask about slicing off.
According to your explanation, we slice the
(former: trg/ latter: trg[:,:-1])
Not the author but if this occurs after padding, and you know for a fact that your data is of multiple lengths and that the EOS token only appears once per trg sentence you could simply iterate over all samples in the batch and replace every 3 (EOS) token with a 1 (padding token). It won't be pretty and there is probably a better way to do this with pytorches builtin functions but it should work.
This is not a problem.
Let's say we have the target sequence: ['<sos>', 'a', 'b', 'c', 'd', '<eos>', '<pad>', '<pad>']
.
Ideally, we should have no padding and have the target sequence input into the decoder be ['<sos>', 'a', 'b', 'c', 'd']
with desired decoder outputs of ['a', 'b', 'c', 'd', '<eos>']
.
However, because of the padding we input target sequence of ['<sos>', 'a', 'b', 'c', 'd', '<eos>', '<pad>']
and now want predicted target sequence of ['a', 'b', 'c', 'd', '<eos>', '<pad>', '<pad>']
.
This looks bad, right? Because the tutorials say we shouldn't input the <eos>
token, but we are. However, when we define the criterion
we set the ignore_index
equal to the padding token index, this means whenever the target is a <pad>
we ignore the losses at that time-step and pretend as if it never happened.
The last two tokens of our desired predicted target sequence are <pad>
tokens, which means we can ignore last two tokens input into the decoder, thus the input sequence we actually train our model on is: ['<sos>', 'a', 'b', 'c', 'd']
. This means we have trained our model exactly the same as if we didn't have the <pad>
tokens at all! The only difference is that we do two extra time-steps of computation and ignore the result. This seems wasteful, but the alternative is to do weird batching stuff where you ensure every sequence in the batch is the same length, which means you get wildly different batch sizes during the epoch which I don't think is good for training.
@jpanaro
Not the author but if this occurs after padding, and you know for a fact that your data is of multiple lengths and that the EOS token only appears once per trg sentence you could simply iterate over all samples in the batch and replace every 3 (EOS) token with a 1 (padding token). It won't be pretty and there is probably a better way to do this with pytorches builtin functions but it should work.
This is not a good idea. Due to the ignore_index
mentioned above if you replace the <eos>
token with a pad token, then you will not be updating your model to learn the last non-eos, non-pad token within a sequence. The model trains fine how it is.
@bentrevett ~~Ah, sorry I should have clarified. The replacement would only occur prior to inputting the trg sequence to the model. When it comes time to calculate the loss I would use the original trg sequence (with no replacements) and only the BOS token removed as specified in the notebook.~~
EDIT: I dug into the notebook a little more and I see where I made a lapse of judgment. Unfortunately the dataloader for my project pads all sequences regardless of original sequence length to 30. This was causing issues during training as cutting off all EOS tokens and reducing length by one resulted in very poor training performance and keeping the logic that was in the original notebook only managed to cut off a padding token 99% of the time since most sequences were not of max length.
Sorry for the misunderstanding, I did not realize you only cut off the EOS token for the longest sequence in the batch, not all sequences. Is there a specific reason it is only done for the longest sequence?
Thanks for your explanation! I understand how it works. I would read about the pytorch doc about the criterion :)