opencompass [Feature] Add a new evaluation dataset

Describe the feature

We want to contribute our benchmark to OpenCompass. Here is the repo:https://github.com/IAAR-Shanghai/UHGEval. However, there is an issue: in one of the tasks within this benchmark, multiple model calls are required to obtain the final evaluation result for a single evaluation data point. We have observed that most benchmarks typically involve only one model call per data point. Please inform us if OpenCompass supports multiple model calls within a single evaluation.

Will you implement it?

[X] I would like to implement this feature and create a PR!

Jan 13 '24 15:01 wyzh0912

models is a list of dict. You can evalute multiple models with one config

Jan 14 '24 12:01 tonysy

models is a list of dict. You can evalute multiple models with one config

My expression may not be clear enough. What I mean is that when evaluating a data point, I need to construct multiple prompts and call the model multiple times. For example, in discriminative evaluation, a data point may contain two sentences, one of which contains hallucinated content, and the other does not. I need to call the model twice. The first call is for evaluating whether the model can correctly identify the presence of hallucinated content in the first sentence, i.e., giving the judgment: that the first sentence contains hallucinated content. The second call is for evaluating whether the model can correctly identify the absence of hallucinated content in the second sentence, i.e., giving the judgment: that the second sentence does not contain hallucinated content. The evaluation is considered successful only when both judgments are correct.

The code is as follows:

answer_hallu, reason_hallu = self.model.is_continuation_hallucinated(hallu, data_point, with_reason=True)
answer_unhallu, reason_unhallu = self.model.is_continuation_hallucinated(unhallu, data_point, with_reason=True)

Each execution of the is_continuation_hallucinated method will call the model once.

Jan 14 '24 13:01 wyzh0912

请问咱们当前有专一评测大模型翻译能力的bench吗？如何使用？谢谢

Jan 15 '24 03:01 White-Friday

@White-Friday Please check Flores. Feel free to re-open if needed.

Feb 28 '24 14:02 tonysy

opencompass opencompass copied to clipboard

[Feature] Add a new evaluation dataset

Describe the feature

Will you implement it?

opencompass
opencompass copied to clipboard