OpenNMT-py
OpenNMT-py copied to clipboard
doubt on scores when length_penalty = avg
@guillaumekln since you wrote the "new" beam_search I have a question. https://github.com/OpenNMT/OpenNMT-py/blob/c8081afa3fab4e169c34a84f9304cf37184f97d7/onmt/translate/beam_search.py#L268-L279 here we apply length_penalty and then recover log_probs However when we translate with lp = avg the PRED SCORES seem to be normalized twice. I can't find out where the issue is. Any clue ?
The score is normalized again when reported in the log:
https://github.com/OpenNMT/OpenNMT-py/blob/2.2.0/onmt/translate/translator.py#L547-L559
this I know. BUT it means that when lp = None or lp = avg, then "scores" (from "decode_strategy.scores" are not the same, in the first case they are not normalized and in the second they are normalized. Is this supposed to be that way ?
in other words, here: https://github.com/OpenNMT/OpenNMT-py/blob/c8081afa3fab4e169c34a84f9304cf37184f97d7/onmt/translate/beam_search.py#L179 shouldn't we add log-probs instead ?
BUT it means that when lp = None or lp = avg, then "scores" (from "decode_strategy.scores" are not the same, in the first case they are not normalized and in the second they are normalized. Is this supposed to be that way ?
Yes this is expected. The final score includes all penalties and "lp = avg" is supposed to normalize the score by the prediction length.
shouldn't we add log-probs instead ?
No, usually we want to return the final score with all penalties included. The hypotheses are also ordered based on this final score, so the returned score should be consistent with this order.
The issue is more related with the automatic normalization when reporting in the log. I think it should not be done because the normalization can break the consistency between the hypotheses order and the score (after normalization the 1-best may no longer have the best score).
I understand but in a scenario where we want to dump all scores to do some reranking, it means that we cannot handle the same way when the user set lp=avg or not. Not very convenient. (When I mean reranking it's with weighted combination of other scores)
Yes you would need to know whether the score is already normalized or not.
Otherwise the code can also be updated to not apply this second normalization when needed (either by delegating the normalization to the decoding code or by passing a flag to specify that the score is already normalized). But this does not solve the inconsistency with lp = None where the reported normalized score may not match the hypotheses order.
@guillaumekln something weird: If I dump scores along with predictions here: https://github.com/OpenNMT/OpenNMT-py/blob/master/onmt/translate/translator.py#L499
n_best_scores = list(map(lambda x: x.item(), trans.pred_scores[: self.n_best]))
if self.report_align:
align_pharaohs = [
build_align_pharaoh(align)
for align in trans.word_aligns[: self.n_best]
]
n_best_preds_align = [
" ".join(align) for align in align_pharaohs
]
n_best_preds = [
pred + DefaultTokens.ALIGNMENT_SEPARATOR + align
for pred, align in zip(
n_best_preds, n_best_preds_align
)
]
if dynamic:
n_best_preds = [transform.apply_reverse(x)
for x in n_best_preds]
all_predictions += [n_best_preds]
out_all = list(map(lambda x: x[0] + "\t" + str(x[1]), zip(n_best_preds, n_best_scores)))
#self.out_file.write("\n".join(n_best_preds) + "\n")
self.out_file.write("\n".join(out_all) + "\n")
self.out_file.flush()
With lp= none, if I take the max score in nbest=5 mode I am getting the same as nbest=1 with lp=avg, I have a difference, meaning that I don't get the same order with nbest scores.
I believe this was discussed in #1320.
yes now I recall this long discussion. So to some extent since best pred is not best score, we could de-normalize the lp=avg scores so that at reporting time we keep the way it is. no ?
Yes, this works to avoid the double normalization.
where would be the best place to do it ?
Actually I think the denormalization would be more confusing. There is the same issue with lp=wu where the scores should not be normalized again (the score is already divided by a term depending on the length).
There are probably 2 possible solutions:
- Remove the normalization when producing the log and let the user do the work to determine whether the scores should be further normalized.
or
- In the beam search result, include both the final beam score used to order the hypotheses and the normalized score (when lp=avg, both scores are equal). Then an option could be added to select which score should be written in the log.
@francoishernandez are you ok with the first solution ? it looks ok to me.
Hmm the second one seems a bit more user-friendly to me.
well so far nobody screamed for the reported scores .... also I have a suggestion: let's take length_penalty average and add alpha in the "avg" mode using cur_len ** alpha. reason is:
- lp=avg is almost always better
- it matches fairseq's behavior
- tuning alpha enable top optimize when using noisy channel reranking (cf FB's paper: https://www.statmt.org/wmt19/pdf/53/WMT33.pdf)
for reference: https://github.com/facebookresearch/fairseq/issues/500
For reference, CTranslate2 with normalize_scores
+ length_penalty
also uses cur_len ** alpha
instead of the equation from Google's paper.
So for now, my suggestion is: default length_penalty=avg default alpha=1 change the lp with **alpha we remove the normalization in the _report_score() function and later on we can always add a normalized score returned by the beam_search. Are you guys ok with this ?