FastChat upload reference file for gpt-4-0125-preview as judge to mitigate wrong reference answers by gpt-4

Why are these changes needed?

We (two Applied Scientists with graduate degrees in Artificial Intelligence) recently manually vet through the reference answers for the default gpt-4 judge. To our surprise, we found that the answers to 13/30 reference answers were wrong for at least one of the two turns.

These questions have ids of 104, 105, 109, 111, 113, 114, 120, 122, 124, 125, 126, 128 and 130.

This inaccurate reference answers lead to mt bench scores on the math, coding and reasoning categories to be misleading.

To solve this, we looked into using the more powerful, recent gpt-4-0125-preview model as an alternate source of reference answers. Specifically, we generated answers using the model, verified the answers manually and repeated the process until we got the correct answers. For some questions such as 116, we tried as many as 50 times but for most questions, the model was able to reach a correct answer within 5 tries.

Intuitively, having correct reference answers as well as a stronger model as the judge would allow the judge model to more accurate determine the goodness of a model's score. To verify this empirically, we measured the gpt-4-0125-preview MT Bench score for 10 popular models[^1] on LMSys leaderboard as well as their regular MT Bench score (if not found on the leaderboard). Then, we correlated the Chat Arena Elo score with gpt-4-0125-preview MT Bench; and separately with regular MT Bench score. The detailed number are below.

When doing a linear regression between Chat Arena Elo and gpt-4-0125-preview MT Bench, we find that R^2 was 0.819 while for Chat Arena Elo with regular MT Bench, it was 0.703.

Given this, we believe gpt-4-0125-preview will be better as a default MT Bench judge in terms of predicting its performance in the real world (proxied by Chat Arena Elo). The auxiliary advantages for gpt-4-0125-preview relative to gpt-4 are:

It typically cost around 40% of what regular MT Bench costs (input tokens cost 1/3 while output tokens cost 1/2)
Rate limits for gpt-4-0125-preview are typically 3-30 times (for different usage tiers) as much as for gpt-4 allow --parallel to be set much higher without running into errors, saving the end-user time.

Model-name	MT Bench	GPT4-0125-preview MT Bench	Chat Arena Elo
GPT-4-1106	9.32	8.79	1251
Claude 3 Opus (20240229)	9.09	8.57	1247
Claude 3 Sonnet (20240229)	8.42	7.82	1190
GPT-4-0314	8.96	7.96	1185
Mixtral	8.3	7.38	1114
gpt-3.5-turbo-0613	8.39	7.37	1113
Yi-34B	7.49	6.46	1099
gpt-3.5-turbo-0125	8.4	7.52	1096
Llama 2 70B	6.86	6.01	1082
NV-Llama2-70B-SteerLM-Chat	7.54	6.57	1076

[^1]: We avoided the gpt-4-0613 and gpt-4-0125-preview models as due to their role as judges, minimizing self-preference biases.

Related issue number (if applicable)

Closes #2988 , #3053

Checks

[x] I've run format.sh to lint the changes in this PR.
[x] I've included any doc changes needed.
[x] I've made sure the relevant tests are passing (if applicable).

Mar 16 '24 00:03 Zhilin123

@Zhilin123 thanks for the PR and generating these results. very helpful. we're working on fixing this too. Since changing judge model affects result quite a lot, we are still reviewing it. we plan to open a new repository that contains this new version MT-bench v1.1. but will definitely take your change into account.

for more context on why there's error in reference answer: we explain it here

Mar 16 '24 01:03 infwinston

You're welcome! I do understand that each benchmark comes with its limitations and your team's work on Vicuna Bench and then MT Bench were helpful to many people to understand the general capabilities of models using automatic metrics (for some perspective to others following the thread - MT Bench came out in Jun 2023 when GPT-4 was the strongest model). Looking forward to MT Bench v1.1 - if there's an approximate timeline for its release, please let me know!

Mar 16 '24 01:03 Zhilin123

we plan to open a new repository that contains this new version MT-bench v1.1. but will definitely take your change into account

@infwinston any news / ETA about this? Is it related to https://github.com/lm-sys/arena-hard and if yes, is the plan to include (updated) MT-Bench questions as a quicker & cheaper "sub-benchmark" -- or is MTBench v1.1 going to be something completely different?

Apr 09 '24 22:04 odelalleau

@odelalleau thanks! recently we've been working hard on a pipeline to generate our next generation benchmark (the Arena-Hard as you mentioned), which we believe offers significantly better separability than MT-bench. So we'll recommend you to try it out.

Here is the blog post draft. we plan to release it later this week. after that we'll add MT-bench-v1.1 in the new repo.

Apr 09 '24 23:04 infwinston

Thanks @infwinston -- Arena-Hard definitely looks like a very interesting benchmark and I expect we will be using it for some of our models, but we (at NVIDIA) also believe there is still a need for a cheap & reliable (and static) benchmark like MT-Bench. There are many situations where one just wants a quick informative signal on model quality without eating too much of their OpenAI credits, and the improved MT-Bench from this PR is a great fit for this (a significant improvement over the older version, while being something like ~20x cheaper than Arena-Hard). MT-Bench also offers scores across a small range of categories, which is helpful to validate "at a glance" whether some technique / data is moving the needle in the expected direction (while from what I can tell of Arena-Hard, it seems like there are 250 categories with only 2 prompts in each, so we can't really analyze performance per category).

As a result, it'd be great to also release this improved MT-Bench to the community. And it'd be best that it comes from your team so as to avoid a situation where multiple third parties would push forward their own "improved" version (which has already started with https://github.com/InflectionAI/Inflection-Benchmarks -- not sure if there are more already). What do you think?

Apr 10 '24 02:04 odelalleau

got it, yes it makes sense and sorry for the delay as there's a lot going on right now. we'll for sure merge this fix into v1.1, as judge model is a big change we want to study deeper. Appreciate your wait and support!

Apr 10 '24 02:04 infwinston

Sounds great, thanks @infwinston!

Apr 10 '24 13:04 odelalleau

FastChat FastChat copied to clipboard

upload reference file for gpt-4-0125-preview as judge to mitigate wrong reference answers by gpt-4

Why are these changes needed?

Related issue number (if applicable)

Checks

FastChat
FastChat copied to clipboard