bert_score
bert_score copied to clipboard
Question about human ratings
trafficstars
Hello, Do you happen to still have the COCO human ratings, just to see the setup. Was M1 and M2 binary, or was it a more fine-grained scale?
What is the setup to report the pearson evaluation? Is the following correct:
from scipy.stats import pearsonr
# Example BLEU scores for different systems (averaged per system on a corpus I suppose?)
bleu_scores = [0.4, 0.45, 0.5, 0.55, 0.6]
# Corresponding human judgment scores (could be M1, M2, or any relevant metric)
# Example scores for Metric M1 (e.g., percentage of captions evaluated as better or equal to human captions)
human_judgment_scores = [0.8, 0.75, 0.9, 0.85, 0.95] # Example scores
pearson_correlation, p_value = pearsonr(bleu_scores, human_judgment_scores)
Thank you, JB