bert_score Question about human ratings

Question about human ratings

Open jbdel opened this issue 1 year ago • 0 comments

trafficstars

Hello, Do you happen to still have the COCO human ratings, just to see the setup. Was M1 and M2 binary, or was it a more fine-grained scale?

What is the setup to report the pearson evaluation? Is the following correct:

from scipy.stats import pearsonr
# Example BLEU scores for different systems (averaged per system on a corpus I suppose?)
bleu_scores = [0.4, 0.45, 0.5, 0.55, 0.6] 
# Corresponding human judgment scores (could be M1, M2, or any relevant metric)
# Example scores for Metric M1 (e.g., percentage of captions evaluated as better or equal to human captions)
human_judgment_scores = [0.8, 0.75, 0.9, 0.85, 0.95]  # Example scores
pearson_correlation, p_value = pearsonr(bleu_scores, human_judgment_scores)

Thank you, JB

Mar 15 '24 16:03 jbdel

bert_score bert_score copied to clipboard

Question about human ratings

bert_score
bert_score copied to clipboard