FastChat
FastChat copied to clipboard
Evidence of bias in the "fun" evaluation method using GPT-4 scores?
As described in the main post, the evaluation method is presented as a cool but informal idea and not a rigorous approach. The method is used to create some pretty compelling plots showing the performance of vicuna relative to other models.
I wanted to present some simple evidence that I've found of bias in the evaluation method. Please don't take this as a criticism -- more just some interesting observations!
First is an "ordering" bias, and second is a "close-to-home" bias. Both of these are based on very little data, so they may themselves be subject to criticism.
Method
First, I wrapped the evaluation in the codebase to aggregate all the scores in the following form (see aggregate_scores_from_table.py):
{
"COMPARISON MODEL": {
"total_score1": 576.0, <-- for comparison model
"total_score2": 696.5, <-- for base model (vicuna)
"failed_eval_count": 0, <-- didn't get a score
"better1": 4, <-- comparison model had higher score
"better2": 76, <-- base model had higher score
"tie": 0 <-- tied scores
},
Second, I used gpt-3.5-turbo to re-run the evaluations already present in the repository. And I ran one more evaluation: vicuna-13b against itself (meaning the same answers for model1 and model2). Note that in my re-runs, I went through and manually assigned cleaned up the scores when they were present somewhere in the text.
"Close-to-home" bias
The full aggregated results:
Evaluator: gpt-4 (original evaluator used by the vicuna team)
{
"alpaca-13b": {
"total_score1": 576.0,
"total_score2": 696.5,
"failed_eval_count": 0,
"better1": 4,
"better2": 76,
"tie": 0
},
"gpt35": {
"total_score1": 684.0,
"total_score2": 633.5,
"failed_eval_count": 0,
"better1": 42,
"better2": 22,
"tie": 16
},
"llama": {
"total_score1": 523.0,
"total_score2": 695.0,
"failed_eval_count": 0,
"better1": 5,
"better2": 75,
"tie": 0
},
"bard": {
"total_score1": 661.5,
"total_score2": 662.0,
"failed_eval_count": 0,
"better1": 28,
"better2": 39,
"tie": 13
}
}
Evaluator: gpt-3.5-turbo
{
"alpaca-13b": {
"total_score1": 583.0,
"total_score2": 700.0,
"failed_eval_count": 0,
"better1": 3,
"better2": 77,
"tie": 0
},
"gpt35": {
"total_score1": 627.0,
"total_score2": 665.0,
"failed_eval_count": 0,
"better1": 14,
"better2": 65,
"tie": 1
},
"llama-13b": {
"total_score1": 588.0,
"total_score2": 708.0,
"failed_eval_count": 0,
"better1": 0,
"better2": 79,
"tie": 1
},
"bard": {
"total_score1": 609.0,
"total_score2": 669.0,
"failed_eval_count": 1,
"better1": 8,
"better2": 71,
"tie": 0
}
}
A few observations:
- The score for vicuna changes by ~10% depending on its pairing!? (true for both gpt-4 and gpt-3.5-turbo evaluations)
- Generally the vicuna score seems higher when it's paired against a weaker model (695 vs llama), and lower when paired with stronger models (633.5 vs gpt3.5).
- Other than gpt-3.5's own score, evaluator gpt-3.5-turbo is actually pretty similar to evaluator gpt-4.
- GPT-3.5 seems to be hard on itself -- gpt-4 gives it a higher score (648) than it gives itself (627)!
It is this last point that I call the "close-to-home" bias. I don't know why, but while you might expect gpt-3.5-turbo to bias its own responses higher... it does the opposite.
"Ordering" bias
Running vicuna-13 against itself using evaluator gpt-3.5-turbo yielded a surprising result! The second model (responses identical to the first) scored significantly higher.
"vicuna-13b-20230322-new-hp-fp16": {
"total_score1": 606.0,
"total_score2": 620.0,
"failed_eval_count": 1,
"better1": 1,
"better2": 11,
"tie": 67
},
While its tricky to compare the results between different pairwise comparisons, the 14 point bias is on the order of differences between the Bard and gpt-3.5 scores. Since vicuna was always set as the second model in the original evaluation, the results (as presented in the blog post) may have a systematic upward bias.