pytorch-seq2seq icon indicating copy to clipboard operation
pytorch-seq2seq copied to clipboard

tut 6 - when slicing the <eos> token off from trg before feeding it into the model

Open JaeyoonChun opened this issue 4 years ago • 4 comments

Hi, I would like to ask about slicing off. According to your explanation, we slice the token off the end of the sequence hoping the model to predict it. But it happens after trg are padded, so except for the longest sentence, I think we just remove the last padding token. I checked it with my data, where batch size was 4. I found shorter sentences still had the token. Could you please tell me how to correctly remove the token? Thanks. KakaoTalk_20200825_180920845 (former: trg/ latter: trg[:,:-1])

JaeyoonChun avatar Aug 26 '20 14:08 JaeyoonChun

Not the author but if this occurs after padding, and you know for a fact that your data is of multiple lengths and that the EOS token only appears once per trg sentence you could simply iterate over all samples in the batch and replace every 3 (EOS) token with a 1 (padding token). It won't be pretty and there is probably a better way to do this with pytorches builtin functions but it should work.

jpanaro avatar Aug 26 '20 21:08 jpanaro

This is not a problem.

Let's say we have the target sequence: ['<sos>', 'a', 'b', 'c', 'd', '<eos>', '<pad>', '<pad>'].

Ideally, we should have no padding and have the target sequence input into the decoder be ['<sos>', 'a', 'b', 'c', 'd'] with desired decoder outputs of ['a', 'b', 'c', 'd', '<eos>'].

However, because of the padding we input target sequence of ['<sos>', 'a', 'b', 'c', 'd', '<eos>', '<pad>'] and now want predicted target sequence of ['a', 'b', 'c', 'd', '<eos>', '<pad>', '<pad>'].

This looks bad, right? Because the tutorials say we shouldn't input the <eos> token, but we are. However, when we define the criterion we set the ignore_index equal to the padding token index, this means whenever the target is a <pad> we ignore the losses at that time-step and pretend as if it never happened.

The last two tokens of our desired predicted target sequence are <pad> tokens, which means we can ignore last two tokens input into the decoder, thus the input sequence we actually train our model on is: ['<sos>', 'a', 'b', 'c', 'd']. This means we have trained our model exactly the same as if we didn't have the <pad> tokens at all! The only difference is that we do two extra time-steps of computation and ignore the result. This seems wasteful, but the alternative is to do weird batching stuff where you ensure every sequence in the batch is the same length, which means you get wildly different batch sizes during the epoch which I don't think is good for training.

@jpanaro

Not the author but if this occurs after padding, and you know for a fact that your data is of multiple lengths and that the EOS token only appears once per trg sentence you could simply iterate over all samples in the batch and replace every 3 (EOS) token with a 1 (padding token). It won't be pretty and there is probably a better way to do this with pytorches builtin functions but it should work.

This is not a good idea. Due to the ignore_index mentioned above if you replace the <eos> token with a pad token, then you will not be updating your model to learn the last non-eos, non-pad token within a sequence. The model trains fine how it is.

bentrevett avatar Aug 27 '20 16:08 bentrevett

@bentrevett ~~Ah, sorry I should have clarified. The replacement would only occur prior to inputting the trg sequence to the model. When it comes time to calculate the loss I would use the original trg sequence (with no replacements) and only the BOS token removed as specified in the notebook.~~

EDIT: I dug into the notebook a little more and I see where I made a lapse of judgment. Unfortunately the dataloader for my project pads all sequences regardless of original sequence length to 30. This was causing issues during training as cutting off all EOS tokens and reducing length by one resulted in very poor training performance and keeping the logic that was in the original notebook only managed to cut off a padding token 99% of the time since most sequences were not of max length.

Sorry for the misunderstanding, I did not realize you only cut off the EOS token for the longest sequence in the batch, not all sequences. Is there a specific reason it is only done for the longest sequence?

jpanaro avatar Aug 27 '20 16:08 jpanaro

Thanks for your explanation! I understand how it works. I would read about the pytorch doc about the criterion :)

JaeyoonChun avatar Aug 28 '20 12:08 JaeyoonChun