[BFCL] Incorrect score evaluation for qwq-32b on the leaderboard
The evaluation for Qwen/QwQ-32B (Prompt) (Novita) is incorrect. The response includes reasoning information that was not removed.
Proposed Fix: Modify the _parse_query_response_prompting function to process the original model_responses as follows:
clean_response = model_responses.rsplit("</think>", maxsplit=1)[-1].strip()
Hey @zhangyingerjelly ,
Thanks for the issue.
As indicated by the name, Qwen/QwQ-32B is inference through Novita AI endpoint. I will follow up with @novita-viktor regarding their implementations.
Hey @HuanzhiMao, @zhangyingerjelly, and @novita-viktor wondering if there's any resolution on this? We should ideally resolve this ASAP so the most representative score is displayed on the leaderboard.