LongBench icon indicating copy to clipboard operation
LongBench copied to clipboard

Evaluation Setup for Reasoning Model

Open YizhaoGao opened this issue 10 months ago • 4 comments

Thanks for the great work in this benchmark. However, I found some of the settings to run native CoT/Reasoning model not so correct.

  1. Models without Reasoning ability run twice (1. think, 2. answer). However, for R1 or R1 distilled model, this setup is confusing. it's very likely that in the second round, they start to think again because the model is tuned in this way. In such case, it will not generate towards "The correct answer is xxx" within 128 tokens.

  2. The thinking token limit (1024 tokens) is too short for some reasoning models.

YizhaoGao avatar Feb 25 '25 02:02 YizhaoGao

Hi, for reasoning models such as OpenAI o1 and DeepSeek R1, the w/ CoT setting is not necessary, as these models automatically output their thinking process whether prompted or not. Nevertheless, we retain this evaluation setting to ensure consistency in results and facilitate comparison. The 1024 token limit is set for the answer, not the thinking process.

bys0318 avatar Mar 03 '25 08:03 bys0318

@bys0318 I was trying to reproduce the results for deepseek-r1. May I know what value for max_new_tokens you used? because the default 128 results in a cutoff on the model's response, since it still hasn't generated the answer, and is still in thinking phase

khalidsaifullaah avatar Jul 23 '25 16:07 khalidsaifullaah

@bys0318 I was trying to reproduce the results for deepseek-r1. May I know what value for max_new_tokens you used? because the default 128 results in a cutoff on the model's response, since it still hasn't generated the answer, and is still in thinking phase

same problem

Andy0422 avatar Sep 08 '25 13:09 Andy0422

Hi, for reasoning models such as OpenAI o1 and DeepSeek R1, the w/ CoT setting is not necessary, as these models automatically output their thinking process whether prompted or not. Nevertheless, we retain this evaluation setting to ensure consistency in results and facilitate comparison. The 1024 token limit is set for the answer, not the thinking process.

@bys0318 Could you give us the setting for Deepseek-r1 test? Thanks!

Andy0422 avatar Sep 08 '25 13:09 Andy0422