ag2 ReasoningAgent benchmarking with SimpleBench

Why are these changes needed?

a draft PR for running the simple bench with ReasoningAgent and this PR is not meant to be merged. source: https://simple-bench.com/

The benchmark results on the sample data (10 prompts) with the gpt-4o-mini is 20%.

Related issue number

Checks

[ ] I've included any doc changes needed for https://docs.ag2.ai/. See https://docs.ag2.ai/docs/contributor-guide/documentation to build and test documentation locally.
[ ] I've added tests (if relevant) corresponding to the changes introduced in this PR.
[ ] I've made sure all auto checks have passed.

Dec 26 '24 14:12 Hk669

Thanks. How about adding the test into the contrib-openai CI?

Dec 26 '24 17:12 sonichi

Thanks. How about adding the test into the contrib-openai CI?

can you please mention if it is for the ReasoningAgent or for the Benchmark? fyi: the ci tests for the reasoningagent are under process in the PR https://github.com/ag2ai/ag2/pull/294

Jan 01 '25 10:01 Hk669

Thanks. How about adding the test into the contrib-openai CI?

can you please mention if it is for the ReasoningAgent or for the Benchmark? fyi: the ci tests for the reasoningagent are under process in the PR #294

I mean, we can add simplebench performance check as an optional CI for reasoning agent. It's only triggered when necessary and requires approval.

Jan 01 '25 20:01 sonichi

I've tested with Anthropic, Gemini, DeepSeek, committed a summary file.

See here.

The strongest results are from Anthropic (and Anthropic's chat UI scored the highest).

Jan 01 '25 23:01 marklysze

I've tested with Anthropic, Gemini, DeepSeek, committed a summary file.

See here.

The strongest results are from Anthropic (and Anthropic's chat UI scored the highest).

the numbers are really interesting, i was thinking lets not add in the chat UI results in the benchmarking of the ReasoningAgent, lets just restrict the results to only the reasoning agent?

cc @marklysze @sonichi @BabyCNM

Update: sorry i've misunderstood the results, i think the comparison looks amazing.

Jan 02 '25 04:01 Hk669

I mean, we can add simplebench performance check as an optional CI for reasoning agent. It's only triggered when necessary and requires approval.

sounds great, let me add the optional CI.

Jan 02 '25 05:01 Hk669

Thanks. How about adding the test into the contrib-openai CI?

added. please let me know if i missed anything. thanks! cc @sonichi

Jan 02 '25 17:01 Hk669

Thanks. How about adding the test into the contrib-openai CI?

added. please let me know if i missed anything. thanks! cc @sonichi

It's better than before. An even better approach is to make a separate workflow so that it's not bundled with other contrib-openai tests. @marklysze @BabyCNM @qingyun-wu what do you think is a good balance between convenience and cost control?

Jan 02 '25 17:01 sonichi

@Hk669 What is the status with this PR?

Feb 12 '25 20:02 davorrunje

This is just an experimental PR, for anyone who wanted to run the simplebench on any agent.

should be a good starting point for the benchmark.

Feb 13 '25 00:02 Hk669

All committers have signed the CLA.

Feb 26 '25 20:02 CLAassistant

@Hk669 / @BabyCNM can we close this?

Sep 16 '25 19:09 marklysze