coco-caption question about the bleu and meteor

 def  compute_score(self, gts, res):

    assert(gts.keys() == res.keys())
    imgIds = gts.keys()

    bleu_scorer = BleuScorer(n=self._n)
    for id in imgIds:
        hypo = res[id]
        ref = gts[id]

        # Sanity check.
        assert(type(hypo) is list)
        assert(len(hypo) == 1)
        assert(type(ref) is list)
        assert(len(ref) >= 1)

        bleu_scorer += (hypo[0], ref)

    #score, scores = bleu_scorer.compute_score(option='shortest')
    score, scores = bleu_scorer.compute_score(option='closest', verbose=1)
    # score, scores = bleu_scorer.compute_score(option='average', verbose=1)

    # return (bleu, bleu_info)
    return score, scores

I find the return has two params, one is score and the other is scores ,and I found mean(scores) is not equal to score , I want to know what these two return values do, and under what circumstances mean(scores) == score , and the same problem occurred with cider .

def compute_score(self, gts, res):
    assert(gts.keys() == res.keys())
    imgIds = gts.keys()
    scores = []

    eval_line = 'EVAL'
    self.lock.acquire()
    for i in imgIds:
        assert(len(res[i]) == 1)
        stat = self._stat(res[i][0], gts[i])
        eval_line += ' ||| {}'.format(stat)

    self.meteor_p.stdin.write('{}\n'.format(eval_line).encode())
    self.meteor_p.stdin.flush()
    for i in range(0,len(imgIds)):
        scores.append(float(self.meteor_p.stdout.readline().strip()))
    score = float(self.meteor_p.stdout.readline().strip())
    self.lock.release()

    return score, scores

and the later is my res

{'testlen': 14006, 'reflen': 14927, 'guess': [14006, 12389, 10773, 9166], 'correct': [2367, 22, 1, 0]}
ratio: 0.9382997253298762
Bleu_1:  0.1582435446030457
Bleu_2:  0.016220982225013343
Bleu_3:  0.0028384843308123897
Bleu_4:  2.198519789887133e-07
METEOR:  0.04443493208767419
ROUGE_L: 0.16704389834453118
CIDEr:   0.028038780435183798
{'testlen': 14006, 'reflen': 14927, 'guess': [14006, 12389, 10773, 9166], 'correct': [2367, 22, 1, 0]}
ratio: 0.9382997253298762
     val_Bleu_1    val_Bleu_2    val_Bleu_3    val_Bleu_4  val_METEOR  \
0  1.312883e-01  2.181574e-03  1.214780e-04  1.884038e-08    0.046652

   val_ROUGE_L  val_CIDEr
0     0.167044   0.028039

We find only cider and rouge is equal . I hope to get your help, thanks

Jun 18 '19 04:06 yeshenpy

same question!

Jul 28 '21 18:07 XinhaoMei

(I had the same question not long ago, and as far as I understand:)

At least for BLEU, is usual that mean(scores) != scores in a corpus

Consider the formula for modified precision p_n in the original paper (subsection 2.1.1)
The first sum c in candidates, both in numerator and denominator, adds over all candidates in the corpus
This implies the average of individual sentences and the corpus calculation may differ, for example:
- Sentence 1 p_n = A/B and Sentence 2 p_n = C/D
- Score in the corpus with both sentences: (A+C) / (B+D)
- Mean of individual scores: (A/B + C/D)/2 (not necessarily equal)
This can be spotted in the NLTK code for the corpus_bleu() function; or in this library in the BleuScorer class comparing the comps (individual scores) and totalcomps (corpus score) variables

Hope this helps

Oct 19 '21 23:10 pdpino