Evaluation Setup for Reasoning Model
Thanks for the great work in this benchmark. However, I found some of the settings to run native CoT/Reasoning model not so correct.
-
Models without Reasoning ability run twice (1. think, 2. answer). However, for R1 or R1 distilled model, this setup is confusing. it's very likely that in the second round, they start to think again because the model is tuned in this way. In such case, it will not generate towards "The correct answer is xxx" within 128 tokens.
-
The thinking token limit (1024 tokens) is too short for some reasoning models.
Hi, for reasoning models such as OpenAI o1 and DeepSeek R1, the w/ CoT setting is not necessary, as these models automatically output their thinking process whether prompted or not. Nevertheless, we retain this evaluation setting to ensure consistency in results and facilitate comparison. The 1024 token limit is set for the answer, not the thinking process.
@bys0318 I was trying to reproduce the results for deepseek-r1. May I know what value for max_new_tokens you used? because the default 128 results in a cutoff on the model's response, since it still hasn't generated the answer, and is still in thinking phase
@bys0318 I was trying to reproduce the results for deepseek-r1. May I know what value for
max_new_tokensyou used? because the default128results in a cutoff on the model's response, since it still hasn't generated the answer, and is still in thinking phase
same problem
Hi, for reasoning models such as OpenAI o1 and DeepSeek R1, the w/ CoT setting is not necessary, as these models automatically output their thinking process whether prompted or not. Nevertheless, we retain this evaluation setting to ensure consistency in results and facilitate comparison. The 1024 token limit is set for the answer, not the thinking process.
@bys0318 Could you give us the setting for Deepseek-r1 test? Thanks!