FastChat icon indicating copy to clipboard operation
FastChat copied to clipboard

How to generate reference answers in MT-Bench?

Open bofenghuang opened this issue 1 year ago • 6 comments

Hi!

Thank you for your excellent work on the LLM evaluation! I'm inspired to create a French version of MT-Bench.

Currently, I'm in the process of generating reference answers for tasks in the math, reasoning, and coding categories. I would appreciate more details on your approach, as I've found myself a bit confused about which version of GPT-4 to use.

At the moment, I'm using gpt-4-0613 as the judge and am looking to evaluate gpt-4-0314 later. My understanding is that I can't use gpt-4-0314 to generate reference answers since the candidate answers from gpt-4-0314 will be identical, leading to a consistent 10/10 score on all three tasks.

I've considered using gpt-4-1106-preview, but this also means I can't evaluate gpt-4-1106-preview later. I would like to learn how you've dealt with similar cases and which version you would recommend. Thanks in advance!

bofenghuang avatar Dec 12 '23 13:12 bofenghuang

Just found the answer in the paper. Should have read it more carefully :)

Hence, we propose a reference-guided method, in which we first generate LLM judge’s answer independently, and then display it as a reference answer in the judge prompt.

However, I've also noticed a column called "reference" in the question file and it's also displayed as "reference solution" on the demo. Wouldn't it be better to use them as reference answers for LLM judge? They seem to be written by humans, albeit shorter, but with higher quality. Thanks in advance!

bofenghuang avatar Dec 14 '23 10:12 bofenghuang

I am also confused about this. I might be missing something obvious but it seems to me that the reference answers are from here: https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/data/mt_bench/reference_answer/gpt-4.jsonl

But for some questions, answers are simply wrong. For instance, question_id 111, the reference_answer gives {"question_id": 111, "answer_id": "NJPTPe2PZ7547UMekB9kBM", "model_id": "gpt-4", "choices": [{"index": 0, "turns": ["To find the area of the triangle, we can use the formula for the area of a triangle with coordinates A(x1, y1), B(x2, y2), and C(x3, y3):\n\nArea = (1/2) * |(x1(y2 - y3) + x2(y3 - y1) + x3(y1 - y2))|\n\nIn this case, the coordinates are A(0, 0), B(-1, 1), and C(3, 3). Plugging these values into the formula, we get:\n\nArea = (1/2) * |(0(1 - 3) + (-1)(3 - 0) + 3(0 - 1))|\n\nArea = (1/2) * |(-0 + 3 - 3)|\n\nArea = (1/2) * |0|\n\nArea = 0\n\nThe area of the triangle is 0. This means that the three points are collinear, and there is no triangle formed by these points.", "Since the three points are collinear and do not form a triangle, there is no circumscribed circle for this set of points. A circumscribed circle can only be formed around a triangle with non-collinear points."]}], "tstamp": 1686287200.2472498}

Though this solution is simply wrong and it scores an actual correct question low because the reference solution is wrong. I must be missing something obvious here, any help would be appreciated :)

huseyinatahaninan avatar Dec 14 '23 23:12 huseyinatahaninan

Hi @huseyinatahaninan

Sorry for the confusion. Let me try to clarify this. In our paper we study reference-based judge which the LLM judge generates a reference answer independently first and then evaluates the model response.

However, one limitation of this approach is the LLM judge itself may generate incorrect response (as you can see in that particular example). We did not modify its answer because 1) for reproducibility 2) we think this represents the limitation of GPT-4 judge. We hope to improve the judge by updating to the newer gpt-4-turbo but we're still investigating the best way.

infwinston avatar Dec 15 '23 06:12 infwinston

Hi @infwinston, many thanks for the clarification, I understand. Just an interesting note that when I downloaded pre-generated data via python3 download_mt_bench_pregenerated.py I see that actually gpt-4 generates more correct answers to math questions. I don't know maybe that gpt-4 was a different version of the judge gpt-4 but just fyi.

huseyinatahaninan avatar Dec 15 '23 16:12 huseyinatahaninan

Hi, I want to ask a follow-up question about generating the reference answer. If I use gen_api_answer.py to generate the answer for a judge A and use the reference answer to evaluate the response from the same model A, does this mean the response to be evaluated will be identical to the reference answer?

d223302 avatar May 06 '24 02:05 d223302

Hi @d223302,

I'm not from LMSYS, but I'd like to share my findings after creating this issue.

If I use gen_api_answer.py to generate the answer for a judge A and use the reference answer to evaluate the response from the same model A, does this mean the response to be evaluated will be identical to the reference answer?

I believe it's better not to include the Judge LLM as a candidate for evaluation to avoid bias in these three categories that rely on reference answers (i.e., math, reasoning, and coding).

Errors in LLM-generated reference answers are a known issue, as discussed here and later in Inflection AI's repo.

In our French version, we've used human-generated reference answers to address this.

PS: LMSYS's new benchmark, Arena-Hard, is very interesting. It offers a larger dataset, higher agreement with Chatbot Arena, improved separability, and will be regularly updated.

bofenghuang avatar May 07 '24 23:05 bofenghuang

Hi, I want to ask a follow-up question about generating the reference answer. If I use gen_api_answer.py to generate the answer for a judge A and use the reference answer to evaluate the response from the same model A, does this mean the response to be evaluated will be identical to the reference answer?

Hi @d223302,

I think so. I compared the pre-generated responses from gpt-4 (https://huggingface.co/spaces/lmsys/mt-bench/resolve/main/data/mt_bench/model_answer/gpt-4.jsonl) and the reference answers (https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/data/mt_bench/reference_answer/gpt-4.jsonl) from FastChat official. Though the two are not identical, they are very similar and I think the difference is because of the responses are obtained from different API calls to OpenAI.

zxzhan avatar Oct 16 '24 06:10 zxzhan