witwicky icon indicating copy to clipboard operation
witwicky copied to clipboard

Fix bug (?) in length penalty computation

Open davidweichiang opened this issue 6 years ago • 12 comments

I think there was an off-by-one error in the length computation. Assuming that the length penalty is computed for real words only (not BOS or EOS), the length of a partial translation should be time_step+1 if the current output word is a real word and time_step if the current output word is EOS.

davidweichiang avatar Jul 03 '19 21:07 davidweichiang

Yes, I used time_step + 1 https://github.com/tnq177/witwicky/commit/527d4dba97e685a4a0ae7ec81aeb8e52448d5448#diff-dd753ec970ab85b1299cb6413899b16cR238

tnq177 avatar Jul 03 '19 21:07 tnq177

Oh I see. But still, shouldn't it depend on whether the current output word is EOS?

davidweichiang avatar Jul 03 '19 21:07 davidweichiang

What do you mean? We have to apply length penalty before calculating the top output words, right?

If you mean the previous step's output word is EOS, then for any finished hypothesis, we keep its finished score and use that for this step's ranking https://github.com/tnq177/witwicky/commit/527d4dba97e685a4a0ae7ec81aeb8e52448d5448#diff-dd753ec970ab85b1299cb6413899b16cR248

tnq177 avatar Jul 03 '19 21:07 tnq177

beam_scores[i][j] contains the score resulting from extending hypothesis i with word j. If j is EOS, then the length of the hypothesis is time_step, so that's the lenght that should be used for computing beam_scores[i][j]. But if j is not EOS, then the length of the new hypothesis is time_step+1, so that length should be used for computing beam_scores[i][j].

davidweichiang avatar Jul 04 '19 00:07 davidweichiang

I think EOS is part of the hypothesis so time_step+1 applies to EOS too.

tnq177 avatar Jul 04 '19 01:07 tnq177

OK, that makes sense. I think there's a slight disadvantage this way -- imagine that the length penalty is very large (i.e., strongly prefers short sentences), and there are two hypotheses "the cat sat on the mat today" and "the cat sat on the mat EOS". The latter should be preferred, but if EOS is part of the hypothesis length, then both hypotheses have the same length penalty. I'm sure this makes very little practical difference, but for random sampling I think it might. Would you be very opposed to excluding EOS from the hypothesis length?

davidweichiang avatar Jul 04 '19 01:07 davidweichiang

but that's not the way the model is trained, is it? A complete sentence in the way we train it must end with an EOS?

tnq177 avatar Jul 04 '19 01:07 tnq177

I mean that although "the cat sat on the mat today" is still an incomplete hypothesis, we know it must have at least 8 words including EOS. With a very large length penalty, then, we know that "the cat sat on the mat today", even though we don't know the rest of it, should be worse than "the cat sat on the mat EOS" which has 7 words.

davidweichiang avatar Jul 04 '19 01:07 davidweichiang

Oh I see your point. But then what do we do about the first step and the second step? Both of them have length 1?

tnq177 avatar Jul 04 '19 01:07 tnq177

also, I think it's only fair that too strong a length penalty will hurt short sentences...it's just trade off. I don't think practically it'll make much change.

tnq177 avatar Jul 04 '19 02:07 tnq177

But then what do we do about the first step and the second step? Both of them have length 1?

The first step and second step shouldn't be any different...if the first step is a real word, it has length 1, else 0, and if the second step is a real word, it has length 2, else 1.

I think it's only fair that too strong a length penalty will hurt short sentences

I think that's exactly what I'm saying. If the length penalty is strong, we want short sentences to be disfavored. In the above example, under greedy decoding, it's currently possible that the shorter translation will get pruned, but under the proposed change, the longer translation will get pruned.

I don't think practically it'll make much change.

I don't think so either (on the included en2vi dataset, it helps BLEU by 0.01%). But in random sampling, which in some ways at some times is like greedy decoding, it might make more difference.

davidweichiang avatar Jul 04 '19 11:07 davidweichiang

then I think it should be included as a strategy for random sampling. For greedy/beam search, I think the current approach is still the right way. Coming back to your example, while "the cat sat on the mat today" and "the cat sat on the mat EOS" have the same length penalty, I think the latter still has a higher probability (EOS comes to it first), so the former can't win.

tnq177 avatar Jul 04 '19 20:07 tnq177