camel
camel copied to clipboard
[Feature Request] Benchmark communicative agents' generated dialogue
Required prerequisites
- [X] I have searched the Issue Tracker and Discussions that this hasn't already been reported. (+1 or comment there if it has.)
- [ ] Consider asking first in a Discussion.
Motivation
Changes such as modifying / adding agents in RolePlaying
or system prompts for agents may have great influence on the generated dialogue. But unfortunately now we don't know whether a change is good or bad. So, it will be good if we can have some scripts to benchmark generated results.
Solution
In general, we can compute and compare some metrics of generated results before and after the change
Traditional neural dialogue system has some metrics to measure quality of generated dialogue, including
- Distinct-N (Diversity of a sentence which penalizes sentences with lots of repeated words)
- ROUGE (measures the quality of the summary by comparing the n-gram overlap between the generated summary and the reference summary, notice that for this we need reference text)
- BLEU (measures the quality of translation) But they have some restrictions such that they may need reference text or the metric is just for a specific task.
- Topic Coherence etc.
Another approach is that we can just ask GPT-4 to give some score for the generated results.
Above just are some draft ideas, any comments or suggestions are welcomed.
Alternatives
No response
Additional context
No response