BARTScore
BARTScore copied to clipboard
Spearman Corrleations for Table-4
In Table-4 in the paper, for summEval dataset you have measured COH, FAC, FLU, INFO. I wanted to know which variants of bart-score you used.
From my understanding of the paper, For factuality(FAC) you must have used BARTScore(s->h) i.e source -> hypothesis.
But i am not clear about FLU, COH and INFO.
If you could please elaborate that will be really helpful.

On the SummEval dataset, for FLU, COH and INFO, we also used BARTScore(s->h).
So what was the reason for using single score (s->h). Does BARTScore holistically measure quality of generated text ?
For example can you report s->h variant of BARTScore and say that overall from the basis of the score, the quality of Text Summary generated by Model A is better than Model B ?
Also how do you decide which BARTScore variant to use for a particular dataset to measure COH, FLU, INFO and FAC ?
Please let me know.
Here are some rules we have followed when deciding which BARTScore variant to use.
- based on the definition of the evaluation perspective (for example, factuality must rely on the source document.)
- modalities/languages supported by PLMs (for example, for Data-to-text, we can only use the h<->r due to the different modalities of source and hypothesis)
However, we agree that designing a metric with multiple interpretable dimensions will be a promising future work.