gorilla [BFCL] Significant Accuracy Gap in multi_turn

Describe the issue The multi_turn_base has significant accuracy discrepancies and it is difficult to achieve the results on the leaderboard. I would like to confirm the statistical method used for the results.

What is the issue

I started a local vLLM service using the officially provided Qwen3-32B weights, and added the model configuration in model_config.py as follows:

python

"qwen-32B-FC": ModelConfig(
    model_name="qwen3-32B-FC",
    display_name="qwen3-32B (FC)",
    url="https://openai.com/index/introducing-gpt-4-5/",
    org="OpenAI",
    license="Proprietary",
    model_handler=OpenAIHandler,
    input_price=None,
    output_price=None,
    is_fc_model=True,
    underscore_to_dot=True,
),

Then I modified the client initialization for testing:

I used the following BFCL command for testing:

bfcl generate --model qwen-32B-FC --test-category multi_turn_base --num-threads 100 Out of 20 test runs, only one achieved an accuracy of 0.56. Most results were below 0.5. Here are some of the accuracy outputs:

score_1: {"accuracy": 0.49, "correct_count": 98, "total_count": 200} score_10: {"accuracy": 0.495, "correct_count": 99, "total_count": 200} score_11: {"accuracy": 0.455, "correct_count": 91, "total_count": 200} score_12: {"accuracy": 0.45, "correct_count": 90, "total_count": 200} score_13: {"accuracy": 0.49, "correct_count": 98, "total_count": 200} score_14: {"accuracy": 0.49, "correct_count": 98, "total_count": 200} score_15: {"accuracy": 0.495, "correct_count": 99, "total_count": 200} score_16: {"accuracy": 0.49, "correct_count": 98, "total_count": 200} score_17: {"accuracy": 0.465, "correct_count": 93, "total_count": 200} score_18: {"accuracy": 0.48, "correct_count": 96, "total_count": 200} score_19: {"accuracy": 0.475, "correct_count": 95, "total_count": 200} score_2: {"accuracy": 0.445, "correct_count": 89, "total_count": 200} score_20: {"accuracy": 0.52, "correct_count": 104, "total_count": 200} score_3: {"accuracy": 0.485, "correct_count": 97, "total_count": 200} score_4: {"accuracy": 0.56, "correct_count": 112, "total_count": 200} score_5: {"accuracy": 0.495, "correct_count": 99, "total_count": 200} score_6: {"accuracy": 0.485, "correct_count": 97, "total_count": 200} score_7: {"accuracy": 0.51, "correct_count": 102, "total_count": 200} score_8: {"accuracy": 0.5, "correct_count": 100, "total_count": 200} score_9: {"accuracy": 0.515, "correct_count": 103, "total_count": 200} I would like to ask: how are the leaderboard results for this task computed? Is the accuracy reported based on a single run with pass@1? Or is it the maximum accuracy selected from multiple runs?

Aug 06 '25 01:08 GaoHuaZhang

I've been experiencing this same issue recently. I set up a local vLLM service running Qwen3-32B and conducted multiple evaluations on the multi_turn category, but my scores show a significant discrepancy from the official leaderboard results.

To investigate further, I evaluated the official response from this link, but consistently obtained 44.5 instead of the reported 54.5. Suspecting version differences, I updated to the latest BFCL version by cloning the repo and creating a fresh conda environment - however the official response still scores 45.

I'm confused if this is due to the evaluation code or the response generation itself.

Aug 06 '25 08:08 Xufeng-Zhan

I've been experiencing this same issue recently. I set up a local vLLM service running Qwen3-32B and conducted multiple evaluations on the multi_turn category, but my scores show a significant discrepancy from the official leaderboard results.

To investigate further, I evaluated the official response from this link, but consistently obtained 44.5 instead of the reported 54.5. Suspecting version differences, I updated to the latest BFCL version by cloning the repo and creating a fresh conda environment - however the official response still scores 45.

I'm confused if this is due to the evaluation code or the response generation itself.

How many times did you conduct the experiment in total? I collected the data and found that the slight differences in multiple rounds of dialogue accumulated, leading to significant variations in the final results. Each outcome was different, even with the seed enabled.

Aug 07 '25 01:08 GaoHuaZhang

I believe our issues are different; I am using api_inference, and the model itself supports returning tool_calls.

Aug 07 '25 08:08 GaoHuaZhang

@GaoHuaZhang ,

Have you confirmed how the scores on the leaderboard were computed?

Now in model_config.py, I can find two implementations, one using QwenAPIHandler and the other using QwenFCHandler.

Also, you are using OpenAIHandler to process this Qwen3 model's output. To match result, feel that you may have to derive from QwenAPIHandler.

    "qwen3-32b-FC": ModelConfig(
        model_name="qwen3-32b-FC",
        display_name="Qwen3-32B (FC)",
        url="https://huggingface.co/Qwen/Qwen3-32B",
        org="Qwen",
        license="apache-2.0",
        model_handler=QwenAPIHandler,
        input_price=None,
        output_price=None,
        is_fc_model=True,
        underscore_to_dot=True,
    ),

    "Qwen/Qwen3-32B-FC": ModelConfig(
        model_name="Qwen/Qwen3-32B-FC",
        display_name="Qwen3-32B (FC)",
        url="https://huggingface.co/Qwen/Qwen3-32B",
        org="Qwen",
        license="apache-2.0",
        model_handler=QwenFCHandler,
        input_price=None,
        output_price=None,
        is_fc_model=True,
        underscore_to_dot=False,
    ),

Sep 11 '25 19:09 leocnj

[BFCL] Significant Accuracy Gap in multi_turn_base Category