w/ cot mode for "thinking" models
Thank you for the timely updates of the leaderboard, yet I had a couple of confusions regarding the w/ CoT column, and was hoping for some clarifications:
-
I noticed that you designed the w/ CoT mode so that a CoT is inferred first, followed by a second inference asking the model to answer based on its w/ CoT. Could you explain a bit more on the significance of this design?
-
How does the w/ CoT mode work for the "thinking" models, and how would that be different from the no CoT mode?
Thanks!
Hi, we follow the design of GPQA for the w/o CoT mode and the w/ CoT mode. In w/ CoT mode, we first ask the model to generate its chain-of-thought to derive the answer. Then for ease of extraction of the answer, it is followed by a second stage to let the model directly output the answer based on the chain-of-thought. For reasoning models such as o1 and R1, the w/ CoT setting is not necessary, as these models automatically output their thinking process whether prompted or not. Nevertheless, we retain this evaluation setting to ensure consistency in results and facilitate comparison.