[BFCL] Incorrect score evaluation for qwq-32b on the leaderboard

Open zhangyingerjelly opened this issue 8 months ago • 1 comments

The evaluation for Qwen/QwQ-32B (Prompt) (Novita) is incorrect. The response includes reasoning information that was not removed.

Proposed Fix: Modify the _parse_query_response_prompting function to process the original model_responses as follows: clean_response = model_responses.rsplit("</think>", maxsplit=1)[-1].strip()

Apr 27 '25 06:04 zhangyingerjelly

Hey @zhangyingerjelly , Thanks for the issue. As indicated by the name, Qwen/QwQ-32B is inference through Novita AI endpoint. I will follow up with @novita-viktor regarding their implementations.

Apr 27 '25 06:04 HuanzhiMao

Hey @HuanzhiMao, @zhangyingerjelly, and @novita-viktor wondering if there's any resolution on this? We should ideally resolve this ASAP so the most representative score is displayed on the leaderboard.

May 11 '25 08:05 ShishirPatil