pytorch-seq2seq
pytorch-seq2seq copied to clipboard
Using trg[: ,;-1] during training
Thank you for this awesome repo you have made public. I had one question, during the training loop, you perform the following step
output, _ = model(src, trg[:,:-1])
I was wondering why are we doing the trg[:,:-1] step?
Kind regards Wajih
This is because we have a target sequence, trg
, of something like [<sos>, A, B, C, <eos>]
. We want our decoder to predict what the next item in the predicted target sequence should be, given the previously predicted target tokens. So, we input a sequence of [<sos>, A, B, C]
(which is trg[:,:-1]
) and want our decoder to predict [A, B, C, <eos>]
(which is trg[:,1:]
).
Thus, we input trg[:,-1]
and use the predicted target with trg[:,1:]
to calculate our losses.
Let me know if this needs clarifying.
This is because we have a target sequence,
trg
, of something like[<sos>, A, B, C, <eos>]
. We want our decoder to predict what the next item in the predicted target sequence should be, given the previously predicted target tokens. So, we input a sequence of[<sos>, A, B, C]
(which istrg[:,:-1]
) and want our decoder to predict[A, B, C, <eos>]
(which istrg[:,1:]
).Thus, we input
trg[:,-1]
and use the predicted target withtrg[:,1:]
to calculate our losses.Let me know if this needs clarifying.
Oh I understand now, Thanks indeed for the elaborated reply.
Wajih
This is because we have a target sequence,
trg
, of something like[<sos>, A, B, C, <eos>]
. We want our decoder to predict what the next item in the predicted target sequence should be, given the previously predicted target tokens. So, we input a sequence of[<sos>, A, B, C]
(which istrg[:,:-1]
) and want our decoder to predict[A, B, C, <eos>]
(which istrg[:,1:]
).Thus, we input
trg[:,-1]
and use the predicted target withtrg[:,1:]
to calculate our losses.Let me know if this needs clarifying.
Hi, how does this work when the trg sentence is padded? In that case I imagine the eos token would no longer be in last position right? or am I missing something.
EDIT: nevermind I figured it out, in case anyone else is wondering: it works with padded inputs anyway because of ignore_index in the loss function.
This is because we have a target sequence,
trg
, of something like[<sos>, A, B, C, <eos>]
. We want our decoder to predict what the next item in the predicted target sequence should be, given the previously predicted target tokens. So, we input a sequence of[<sos>, A, B, C]
(which istrg[:,:-1]
) and want our decoder to predict[A, B, C, <eos>]
(which istrg[:,1:]
). Thus, we inputtrg[:,-1]
and use the predicted target withtrg[:,1:]
to calculate our losses. Let me know if this needs clarifying.Hi, how does this work when the trg sentence is padded? In that case I imagine the eos token would no longer be in last position right? or am I missing something.
EDIT: nevermind I figured it out, in case anyone else is wondering: it works with padded inputs anyway because of ignore_index in the loss function.
Sorry for the late reply - seems like you've figured it out now but just in case someone else is reading this then I'll explain.
When we have padding then our trg
sequence will be something like [<sos>, A, B, C, <eos>, <pad>, <pad>]
. So the sequence input into the decoder is [<sos>, A, B, C, <eos>, <pad>]
(trg[:,:,-1]
) and our decoder will be trying to predict the sequence [A, B, C, <eos>, <pad>]
(trg[:,1:]
).
This means that yes, the <eos>
token is input into the model even though it shouldn't be - because why should you predict something after the end of the sequence? - but there is no way to avoid this when padding sequences. However, because we set the ignore_index
of our CrossEntropyLoss
to be the index of the padding token, whenever the decoder's target token is a <pad>
token we don't calculate losses over that token.
So in the above example, we only calculate the losses when the decoder's input is [<sos>, A, B, C]
because the <eos>
and <pad>
token both have a target token of <pad>
. This means we calculate our losses (and thus update our parameters) as if the padding tokens didn't exist (sort of, we still have to waste some computation but this is offset by the fact that we can use batches instead of feeding in examples one at a time or only making batches where every sequence is the exact same length)
This is because we have a target sequence,
trg
, of something like[<sos>, A, B, C, <eos>]
. We want our decoder to predict what the next item in the predicted target sequence should be, given the previously predicted target tokens. So, we input a sequence of[<sos>, A, B, C]
(which istrg[:,:-1]
) and want our decoder to predict[A, B, C, <eos>]
(which istrg[:,1:]
).Thus, we input
trg[:,-1]
and use the predicted target withtrg[:,1:]
to calculate our losses.Let me know if this needs clarifying.
I have a question the sentence is padded after eos so, the sentences are like: sos y1,y2, eos , pad , pad , pad sos y1,y2,y3,y4,y5, eos sos y1,y2, y3,y4, eos , pad
the size is trg is [3,7]
if trg is trg[:,:-1]
the sentences is cutted like
sos y1,y2, eos , pad , pad
sos y1,y2,y3,y4,y5,
sos y1,y2, y3,y4, eos
so, it is not cut all eos
I check the torchtxt, the sentence is concatenated as: sos sentence eos pad , trg[:,:-1] will not cut all eos
if sentence is concatenated like: sos sentence pad eos , in this case, it will cut all eos
For anyone who'll find this in future, output, _ = model(src, trg[:,:-1])
seems to no longer be there, but the decoder loop in the Seq2Seq
class starts from 0 to trg-1. It's currently written as for t in range(1, trg_len):
, where input
is always t-1
at the start of each loop (it increments at the end).
Took me a minute to figure out where the [:,:-1]
went.
https://github.com/bentrevett/pytorch-seq2seq/issues/182 # more in depth explaination of trg[:,:-1]
and how it interacts with padding.
https://github.com/bentrevett/pytorch-seq2seq/issues/43#issuecomment-554986488 #impact of <sos>
and <eos>
tokens on src -> model learns to ignore.
For anyone who'll find this in future,
output, _ = model(src, trg[:,:-1])
seems to no longer be there, but the decoder loop in theSeq2Seq
class starts from 0 to trg-1. It's currently written asfor t in range(1, trg_len):
, whereinput
is alwayst-1
at the start of each loop (it increments at the end). Took me a minute to figure out where the[:,:-1]
went.#182 # more in depth explaination of
trg[:,:-1]
and how it interacts with padding. #43 (comment) #impact of<sos>
and<eos>
tokens on src -> model learns to ignore.
You are correct. Seems to have been updated now.