coco-caption
coco-caption copied to clipboard
question about the bleu and meteor
def compute_score(self, gts, res):
assert(gts.keys() == res.keys())
imgIds = gts.keys()
bleu_scorer = BleuScorer(n=self._n)
for id in imgIds:
hypo = res[id]
ref = gts[id]
# Sanity check.
assert(type(hypo) is list)
assert(len(hypo) == 1)
assert(type(ref) is list)
assert(len(ref) >= 1)
bleu_scorer += (hypo[0], ref)
#score, scores = bleu_scorer.compute_score(option='shortest')
score, scores = bleu_scorer.compute_score(option='closest', verbose=1)
# score, scores = bleu_scorer.compute_score(option='average', verbose=1)
# return (bleu, bleu_info)
return score, scores
I find the return has two params, one is score and the other is scores ,and I found mean(scores) is not equal to score , I want to know what these two return values do, and under what circumstances mean(scores) == score , and the same problem occurred with cider .
def compute_score(self, gts, res):
assert(gts.keys() == res.keys())
imgIds = gts.keys()
scores = []
eval_line = 'EVAL'
self.lock.acquire()
for i in imgIds:
assert(len(res[i]) == 1)
stat = self._stat(res[i][0], gts[i])
eval_line += ' ||| {}'.format(stat)
self.meteor_p.stdin.write('{}\n'.format(eval_line).encode())
self.meteor_p.stdin.flush()
for i in range(0,len(imgIds)):
scores.append(float(self.meteor_p.stdout.readline().strip()))
score = float(self.meteor_p.stdout.readline().strip())
self.lock.release()
return score, scores
and the later is my res
{'testlen': 14006, 'reflen': 14927, 'guess': [14006, 12389, 10773, 9166], 'correct': [2367, 22, 1, 0]}
ratio: 0.9382997253298762
Bleu_1: 0.1582435446030457
Bleu_2: 0.016220982225013343
Bleu_3: 0.0028384843308123897
Bleu_4: 2.198519789887133e-07
METEOR: 0.04443493208767419
ROUGE_L: 0.16704389834453118
CIDEr: 0.028038780435183798
{'testlen': 14006, 'reflen': 14927, 'guess': [14006, 12389, 10773, 9166], 'correct': [2367, 22, 1, 0]}
ratio: 0.9382997253298762
val_Bleu_1 val_Bleu_2 val_Bleu_3 val_Bleu_4 val_METEOR \
0 1.312883e-01 2.181574e-03 1.214780e-04 1.884038e-08 0.046652
val_ROUGE_L val_CIDEr
0 0.167044 0.028039
We find only cider and rouge is equal . I hope to get your help, thanks
same question!
(I had the same question not long ago, and as far as I understand:)
At least for BLEU, is usual that mean(scores) != scores
in a corpus
- Consider the formula for modified precision
p_n
in the original paper (subsection 2.1.1) - The first sum
c in candidates
, both in numerator and denominator, adds over all candidates in the corpus - This implies the average of individual sentences and the corpus calculation may differ, for example:
- Sentence 1
p_n = A/B
and Sentence 2p_n = C/D
- Score in the corpus with both sentences:
(A+C) / (B+D)
- Mean of individual scores:
(A/B + C/D)/2
(not necessarily equal)
- Sentence 1
- This can be spotted in the NLTK code for the
corpus_bleu()
function; or in this library in theBleuScorer
class comparing thecomps
(individual scores) andtotalcomps
(corpus score) variables
Hope this helps