camel [Feature Request] Benchmark communicative agents' generated dialogue

[Feature Request] Benchmark communicative agents' generated dialogue

Open zechengz opened this issue 1 year ago • 0 comments

Required prerequisites

[X] I have searched the Issue Tracker and Discussions that this hasn't already been reported. (+1 or comment there if it has.)
[ ] Consider asking first in a Discussion.

Motivation

Changes such as modifying / adding agents in RolePlaying or system prompts for agents may have great influence on the generated dialogue. But unfortunately now we don't know whether a change is good or bad. So, it will be good if we can have some scripts to benchmark generated results.

Solution

In general, we can compute and compare some metrics of generated results before and after the change

Traditional neural dialogue system has some metrics to measure quality of generated dialogue, including

Distinct-N (Diversity of a sentence which penalizes sentences with lots of repeated words)
ROUGE (measures the quality of the summary by comparing the n-gram overlap between the generated summary and the reference summary, notice that for this we need reference text)
BLEU (measures the quality of translation) But they have some restrictions such that they may need reference text or the metric is just for a specific task.
Topic Coherence etc.

Another approach is that we can just ask GPT-4 to give some score for the generated results.

Above just are some draft ideas, any comments or suggestions are welcomed.

Alternatives

No response

Additional context

No response

Jul 13 '23 10:07 zechengz

camel camel copied to clipboard

[Feature Request] Benchmark communicative agents' generated dialogue

Required prerequisites

Motivation

Solution

Alternatives

Additional context

camel
camel copied to clipboard