gorilla icon indicating copy to clipboard operation
gorilla copied to clipboard

[BFCL] Significant Accuracy Gap in multi_turn_base Category

Open GaoHuaZhang opened this issue 5 months ago • 4 comments

Describe the issue The multi_turn_base has significant accuracy discrepancies and it is difficult to achieve the results on the leaderboard. I would like to confirm the statistical method used for the results.

Image

What is the issue

I started a local vLLM service using the officially provided Qwen3-32B weights, and added the model configuration in model_config.py as follows:

python

"qwen-32B-FC": ModelConfig(
    model_name="qwen3-32B-FC",
    display_name="qwen3-32B (FC)",
    url="https://openai.com/index/introducing-gpt-4-5/",
    org="OpenAI",
    license="Proprietary",
    model_handler=OpenAIHandler,
    input_price=None,
    output_price=None,
    is_fc_model=True,
    underscore_to_dot=True,
),

Then I modified the client initialization for testing:

Image I used the following BFCL command for testing:

bfcl generate --model qwen-32B-FC --test-category multi_turn_base --num-threads 100 Out of 20 test runs, only one achieved an accuracy of 0.56. Most results were below 0.5. Here are some of the accuracy outputs:

score_1: {"accuracy": 0.49, "correct_count": 98, "total_count": 200} score_10: {"accuracy": 0.495, "correct_count": 99, "total_count": 200} score_11: {"accuracy": 0.455, "correct_count": 91, "total_count": 200} score_12: {"accuracy": 0.45, "correct_count": 90, "total_count": 200} score_13: {"accuracy": 0.49, "correct_count": 98, "total_count": 200} score_14: {"accuracy": 0.49, "correct_count": 98, "total_count": 200} score_15: {"accuracy": 0.495, "correct_count": 99, "total_count": 200} score_16: {"accuracy": 0.49, "correct_count": 98, "total_count": 200} score_17: {"accuracy": 0.465, "correct_count": 93, "total_count": 200} score_18: {"accuracy": 0.48, "correct_count": 96, "total_count": 200} score_19: {"accuracy": 0.475, "correct_count": 95, "total_count": 200} score_2: {"accuracy": 0.445, "correct_count": 89, "total_count": 200} score_20: {"accuracy": 0.52, "correct_count": 104, "total_count": 200} score_3: {"accuracy": 0.485, "correct_count": 97, "total_count": 200} score_4: {"accuracy": 0.56, "correct_count": 112, "total_count": 200} score_5: {"accuracy": 0.495, "correct_count": 99, "total_count": 200} score_6: {"accuracy": 0.485, "correct_count": 97, "total_count": 200} score_7: {"accuracy": 0.51, "correct_count": 102, "total_count": 200} score_8: {"accuracy": 0.5, "correct_count": 100, "total_count": 200} score_9: {"accuracy": 0.515, "correct_count": 103, "total_count": 200} I would like to ask: how are the leaderboard results for this task computed? Is the accuracy reported based on a single run with pass@1? Or is it the maximum accuracy selected from multiple runs?

GaoHuaZhang avatar Aug 06 '25 01:08 GaoHuaZhang

I've been experiencing this same issue recently. I set up a local vLLM service running Qwen3-32B and conducted multiple evaluations on the multi_turn category, but my scores show a significant discrepancy from the official leaderboard results.

To investigate further, I evaluated the official response from this link, but consistently obtained 44.5 instead of the reported 54.5. Suspecting version differences, I updated to the latest BFCL version by cloning the repo and creating a fresh conda environment - however the official response still scores 45.

I'm confused if this is due to the evaluation code or the response generation itself.

Xufeng-Zhan avatar Aug 06 '25 08:08 Xufeng-Zhan

I've been experiencing this same issue recently. I set up a local vLLM service running Qwen3-32B and conducted multiple evaluations on the multi_turn category, but my scores show a significant discrepancy from the official leaderboard results.

To investigate further, I evaluated the official response from this link, but consistently obtained 44.5 instead of the reported 54.5. Suspecting version differences, I updated to the latest BFCL version by cloning the repo and creating a fresh conda environment - however the official response still scores 45.

I'm confused if this is due to the evaluation code or the response generation itself.

How many times did you conduct the experiment in total? I collected the data and found that the slight differences in multiple rounds of dialogue accumulated, leading to significant variations in the final results. Each outcome was different, even with the seed enabled.

GaoHuaZhang avatar Aug 07 '25 01:08 GaoHuaZhang

I believe our issues are different; I am using api_inference, and the model itself supports returning tool_calls.

GaoHuaZhang avatar Aug 07 '25 08:08 GaoHuaZhang

@GaoHuaZhang ,

Have you confirmed how the scores on the leaderboard were computed?

Now in model_config.py, I can find two implementations, one using QwenAPIHandler and the other using QwenFCHandler.

Also, you are using OpenAIHandler to process this Qwen3 model's output. To match result, feel that you may have to derive from QwenAPIHandler.

    "qwen3-32b-FC": ModelConfig(
        model_name="qwen3-32b-FC",
        display_name="Qwen3-32B (FC)",
        url="https://huggingface.co/Qwen/Qwen3-32B",
        org="Qwen",
        license="apache-2.0",
        model_handler=QwenAPIHandler,
        input_price=None,
        output_price=None,
        is_fc_model=True,
        underscore_to_dot=True,
    ),

    "Qwen/Qwen3-32B-FC": ModelConfig(
        model_name="Qwen/Qwen3-32B-FC",
        display_name="Qwen3-32B (FC)",
        url="https://huggingface.co/Qwen/Qwen3-32B",
        org="Qwen",
        license="apache-2.0",
        model_handler=QwenFCHandler,
        input_price=None,
        output_price=None,
        is_fc_model=True,
        underscore_to_dot=False,
    ),


leocnj avatar Sep 11 '25 19:09 leocnj