gorilla icon indicating copy to clipboard operation
gorilla copied to clipboard

[BFCL] Is there support for running BFCL evaluation with GPT-OSS?

Open Ofir408 opened this issue 4 months ago • 8 comments

Hi, Is there support for running BFCL evaluation with GPT-OSS? For example, https://huggingface.co/openai/gpt-oss-120b

Ofir408 avatar Aug 06 '25 08:08 Ofir408

Seconding this for both 120B and 20B, since their instruction-following is good but coding is usually subpar

BradKML avatar Aug 07 '25 07:08 BradKML

Seconding this for both 120B and 20B, since their instruction-following is good but coding is usually subpar

Hi, GPT-OSS shows interesting performance on an alternative function-calling benchmark (Tau-Bench), so I think it would be valuable to add support for BFCL to run this model. Would that be possible? Thanks so much!

Ofir408 avatar Aug 12 '25 06:08 Ofir408

This should be pretty-straightforward assuming it is supported by the inference providers. @HuanzhiMao we should include it in the v4 update. Thank you for raising it @Ofir408 and @BradKML !

ShishirPatil avatar Aug 21 '25 07:08 ShishirPatil

@ShishirPatil if you guys can do BigCodeBench type visualization for price or parameters, that would be good as well. People should be able to pick models with the most accurate behavior and/or the least hallucination.

BradKML avatar Aug 21 '25 07:08 BradKML

Hi @Ofir408 @BradKML , Thanks for the issue. We’re currently blocked by an upstream issue in vLLM. Their GPT-OSS integration is returning 500s even on simple queries. The vLLM team has shared an ETA of ~2 weeks for a fix.

HuanzhiMao avatar Sep 06 '25 22:09 HuanzhiMao

Thanks for the notice. And also, do you guys know why everyone from BigCodeBench and LiveBench seems to not update for the last quarter? Something feels completely off with the benchmarking world

BradKML avatar Sep 07 '25 05:09 BradKML

The vLLM team are still working on the fix: https://github.com/vllm-project/vllm/issues/23292

HuanzhiMao avatar Sep 19 '25 00:09 HuanzhiMao

Note that I'm running BFCL with vLLM locally as I exercise and improve the gpt-oss implementation for different parts of vLLM. While I'm in no place to publicly share out results, it is possible to run this today. I'm focused on cleaning up the Chat Completions side of vLLM first, and show below how to run that against a local vLLM. Note that I am using vLLM builds from main, as that has some fixes for parsed tool call content. If you run against v0.10.2 or older, just know that this may work but will likely have much lower scores than what the model is actually capable of until we get a new vLLM release.


# Start a vLLM running gpt-oss-120b setup for tool calling
# adjust tensor-parallel-size based on your GPU setup as needed
vllm serve openai/gpt-oss-120b \
  --async-scheduling \
  --tensor-parallel-size 4 \
  --enable-auto-tool-choice \
  --tool-call-parser openai

# Adjust the model_config to add gpt-oss-120b setup for Function Calling
cat <<EOF >> bfcl_eval/constants/model_config.py
MODEL_CONFIG_MAPPING = {
    "openai/gpt-oss-120b": ModelConfig(
        model_name="openai/gpt-oss-120b",
        display_name="openai/gpt-oss-120b (FC) (vLLM)",
        url="https://huggingface.co/openai/gpt-oss-120b",
        org="OpenAI",
        license="apache-2.0",
        model_handler=OpenAICompletionsHandler,
        input_price=None,
        output_price=None,
        is_fc_model=True,
        underscore_to_dot=True,
    ),
}
EOF

# Verify gpt-oss-120b is the model listed
bfcl models

# Point the OpenAICompletionsHandler at your vLLM
export OPENAI_BASE_URL="http://localhost:8000/v1"
export OPENAI_API_KEY="fake"

# Run one or more BFCL categories against your local vLLM
bfcl generate \
  --model openai/gpt-oss-120b \
  --test-category live_simple \
  --num-threads 8 \
  --allow-overwrite

# See your results
bfcl results

# Evaluate your results
bfcl evaluate

bbrowning avatar Sep 24 '25 18:09 bbrowning

@HuanzhiMao Im currently running GPT-OSS-120b in production with vLLM, and tool calls work well in the latest vLLM release.

I was going to benchmark this model with bfcl-eval, but get an error that the model is not supported.

Is there a way to bypass this check and run the benchmark anyway?

Mikeriess avatar Nov 17 '25 14:11 Mikeriess

@Mikeriess did you add this in the model_config.py as suggested above?

MODEL_CONFIG_MAPPING = {
    "openai/gpt-oss-120b": ModelConfig(
        model_name="openai/gpt-oss-120b",
        display_name="openai/gpt-oss-120b (FC) (vLLM)",
        url="https://huggingface.co/openai/gpt-oss-120b",
        org="OpenAI",
        license="apache-2.0",
        model_handler=OpenAICompletionsHandler,
        input_price=None,
        output_price=None,
        is_fc_model=True,
        underscore_to_dot=True,
    ),
}

hanseungwook avatar Nov 25 '25 00:11 hanseungwook