OpenNMT-py icon indicating copy to clipboard operation
OpenNMT-py copied to clipboard

doubt on scores when length_penalty = avg

Open vince62s opened this issue 1 year ago • 10 comments

@guillaumekln since you wrote the "new" beam_search I have a question. https://github.com/OpenNMT/OpenNMT-py/blob/c8081afa3fab4e169c34a84f9304cf37184f97d7/onmt/translate/beam_search.py#L268-L279 here we apply length_penalty and then recover log_probs However when we translate with lp = avg the PRED SCORES seem to be normalized twice. I can't find out where the issue is. Any clue ?

vince62s avatar Jul 13 '22 13:07 vince62s

The score is normalized again when reported in the log:

https://github.com/OpenNMT/OpenNMT-py/blob/2.2.0/onmt/translate/translator.py#L547-L559

guillaumekln avatar Jul 13 '22 14:07 guillaumekln

this I know. BUT it means that when lp = None or lp = avg, then "scores" (from "decode_strategy.scores" are not the same, in the first case they are not normalized and in the second they are normalized. Is this supposed to be that way ?

vince62s avatar Jul 13 '22 14:07 vince62s

in other words, here: https://github.com/OpenNMT/OpenNMT-py/blob/c8081afa3fab4e169c34a84f9304cf37184f97d7/onmt/translate/beam_search.py#L179 shouldn't we add log-probs instead ?

vince62s avatar Jul 13 '22 14:07 vince62s

BUT it means that when lp = None or lp = avg, then "scores" (from "decode_strategy.scores" are not the same, in the first case they are not normalized and in the second they are normalized. Is this supposed to be that way ?

Yes this is expected. The final score includes all penalties and "lp = avg" is supposed to normalize the score by the prediction length.

shouldn't we add log-probs instead ?

No, usually we want to return the final score with all penalties included. The hypotheses are also ordered based on this final score, so the returned score should be consistent with this order.

The issue is more related with the automatic normalization when reporting in the log. I think it should not be done because the normalization can break the consistency between the hypotheses order and the score (after normalization the 1-best may no longer have the best score).

guillaumekln avatar Jul 13 '22 14:07 guillaumekln

I understand but in a scenario where we want to dump all scores to do some reranking, it means that we cannot handle the same way when the user set lp=avg or not. Not very convenient. (When I mean reranking it's with weighted combination of other scores)

vince62s avatar Jul 13 '22 14:07 vince62s

Yes you would need to know whether the score is already normalized or not.

Otherwise the code can also be updated to not apply this second normalization when needed (either by delegating the normalization to the decoding code or by passing a flag to specify that the score is already normalized). But this does not solve the inconsistency with lp = None where the reported normalized score may not match the hypotheses order.

guillaumekln avatar Jul 13 '22 15:07 guillaumekln

@guillaumekln something weird: If I dump scores along with predictions here: https://github.com/OpenNMT/OpenNMT-py/blob/master/onmt/translate/translator.py#L499

            n_best_scores = list(map(lambda x: x.item(), trans.pred_scores[: self.n_best]))
            if self.report_align:
                align_pharaohs = [
                    build_align_pharaoh(align)
                    for align in trans.word_aligns[: self.n_best]
                ]
                n_best_preds_align = [
                    " ".join(align) for align in align_pharaohs
                ]
                n_best_preds = [
                    pred + DefaultTokens.ALIGNMENT_SEPARATOR + align
                    for pred, align in zip(
                        n_best_preds, n_best_preds_align
                    )
                ]

            if dynamic:
                n_best_preds = [transform.apply_reverse(x)
                                for x in n_best_preds]
            all_predictions += [n_best_preds]
            
            out_all = list(map(lambda x: x[0] + "\t" + str(x[1]), zip(n_best_preds, n_best_scores)))
            
            #self.out_file.write("\n".join(n_best_preds) + "\n")
            self.out_file.write("\n".join(out_all) + "\n")
            self.out_file.flush()

With lp= none, if I take the max score in nbest=5 mode I am getting the same as nbest=1 with lp=avg, I have a difference, meaning that I don't get the same order with nbest scores.

vince62s avatar Jul 17 '22 12:07 vince62s

I believe this was discussed in #1320.

guillaumekln avatar Jul 18 '22 07:07 guillaumekln

yes now I recall this long discussion. So to some extent since best pred is not best score, we could de-normalize the lp=avg scores so that at reporting time we keep the way it is. no ?

vince62s avatar Jul 18 '22 08:07 vince62s

Yes, this works to avoid the double normalization.

guillaumekln avatar Jul 18 '22 09:07 guillaumekln

where would be the best place to do it ?

vince62s avatar Aug 25 '22 09:08 vince62s

Actually I think the denormalization would be more confusing. There is the same issue with lp=wu where the scores should not be normalized again (the score is already divided by a term depending on the length).

There are probably 2 possible solutions:

  • Remove the normalization when producing the log and let the user do the work to determine whether the scores should be further normalized.

or

  • In the beam search result, include both the final beam score used to order the hypotheses and the normalized score (when lp=avg, both scores are equal). Then an option could be added to select which score should be written in the log.

guillaumekln avatar Aug 31 '22 08:08 guillaumekln

@francoishernandez are you ok with the first solution ? it looks ok to me.

vince62s avatar Aug 31 '22 11:08 vince62s

Hmm the second one seems a bit more user-friendly to me.

francoishernandez avatar Sep 02 '22 13:09 francoishernandez

well so far nobody screamed for the reported scores .... also I have a suggestion: let's take length_penalty average and add alpha in the "avg" mode using cur_len ** alpha. reason is:

  1. lp=avg is almost always better
  2. it matches fairseq's behavior
  3. tuning alpha enable top optimize when using noisy channel reranking (cf FB's paper: https://www.statmt.org/wmt19/pdf/53/WMT33.pdf)

for reference: https://github.com/facebookresearch/fairseq/issues/500

vince62s avatar Sep 02 '22 14:09 vince62s

For reference, CTranslate2 with normalize_scores + length_penalty also uses cur_len ** alpha instead of the equation from Google's paper.

guillaumekln avatar Sep 02 '22 14:09 guillaumekln

So for now, my suggestion is: default length_penalty=avg default alpha=1 change the lp with **alpha we remove the normalization in the _report_score() function and later on we can always add a normalized score returned by the beam_search. Are you guys ok with this ?

vince62s avatar Sep 02 '22 14:09 vince62s