witwicky
witwicky copied to clipboard
Fix bug (?) in length penalty computation
I think there was an off-by-one error in the length computation. Assuming that the length penalty is computed for real words only (not BOS or EOS), the length of a partial translation should be time_step+1 if the current output word is a real word and time_step if the current output word is EOS.
Yes, I used time_step + 1 https://github.com/tnq177/witwicky/commit/527d4dba97e685a4a0ae7ec81aeb8e52448d5448#diff-dd753ec970ab85b1299cb6413899b16cR238
Oh I see. But still, shouldn't it depend on whether the current output word is EOS?
What do you mean? We have to apply length penalty before calculating the top output words, right?
If you mean the previous step's output word is EOS, then for any finished hypothesis, we keep its finished score and use that for this step's ranking https://github.com/tnq177/witwicky/commit/527d4dba97e685a4a0ae7ec81aeb8e52448d5448#diff-dd753ec970ab85b1299cb6413899b16cR248
beam_scores[i][j] contains the score resulting from extending hypothesis i with word j. If j is EOS, then the length of the hypothesis is time_step, so that's the lenght that should be used for computing beam_scores[i][j]. But if j is not EOS, then the length of the new hypothesis is time_step+1, so that length should be used for computing beam_scores[i][j].
I think EOS is part of the hypothesis so time_step+1 applies to EOS too.
OK, that makes sense. I think there's a slight disadvantage this way -- imagine that the length penalty is very large (i.e., strongly prefers short sentences), and there are two hypotheses "the cat sat on the mat today" and "the cat sat on the mat EOS". The latter should be preferred, but if EOS is part of the hypothesis length, then both hypotheses have the same length penalty. I'm sure this makes very little practical difference, but for random sampling I think it might. Would you be very opposed to excluding EOS from the hypothesis length?
but that's not the way the model is trained, is it? A complete sentence in the way we train it must end with an EOS?
I mean that although "the cat sat on the mat today" is still an incomplete hypothesis, we know it must have at least 8 words including EOS. With a very large length penalty, then, we know that "the cat sat on the mat today", even though we don't know the rest of it, should be worse than "the cat sat on the mat EOS" which has 7 words.
Oh I see your point. But then what do we do about the first step and the second step? Both of them have length 1?
also, I think it's only fair that too strong a length penalty will hurt short sentences...it's just trade off. I don't think practically it'll make much change.
But then what do we do about the first step and the second step? Both of them have length 1?
The first step and second step shouldn't be any different...if the first step is a real word, it has length 1, else 0, and if the second step is a real word, it has length 2, else 1.
I think it's only fair that too strong a length penalty will hurt short sentences
I think that's exactly what I'm saying. If the length penalty is strong, we want short sentences to be disfavored. In the above example, under greedy decoding, it's currently possible that the shorter translation will get pruned, but under the proposed change, the longer translation will get pruned.
I don't think practically it'll make much change.
I don't think so either (on the included en2vi dataset, it helps BLEU by 0.01%). But in random sampling, which in some ways at some times is like greedy decoding, it might make more difference.
then I think it should be included as a strategy for random sampling. For greedy/beam search, I think the current approach is still the right way. Coming back to your example, while "the cat sat on the mat today" and "the cat sat on the mat EOS" have the same length penalty, I think the latter still has a higher probability (EOS comes to it first), so the former can't win.