ReasoningAgent benchmarking with SimpleBench
Why are these changes needed?
a draft PR for running the simple bench with ReasoningAgent and this PR is not meant to be merged. source: https://simple-bench.com/
The benchmark results on the sample data (10 prompts) with the gpt-4o-mini is 20%.
Related issue number
Checks
- [ ] I've included any doc changes needed for https://docs.ag2.ai/. See https://docs.ag2.ai/docs/contributor-guide/documentation to build and test documentation locally.
- [ ] I've added tests (if relevant) corresponding to the changes introduced in this PR.
- [ ] I've made sure all auto checks have passed.
Thanks. How about adding the test into the contrib-openai CI?
Thanks. How about adding the test into the contrib-openai CI?
can you please mention if it is for the ReasoningAgent or for the Benchmark? fyi: the ci tests for the reasoningagent are under process in the PR https://github.com/ag2ai/ag2/pull/294
Thanks. How about adding the test into the contrib-openai CI?
can you please mention if it is for the ReasoningAgent or for the Benchmark? fyi: the ci tests for the reasoningagent are under process in the PR #294
I mean, we can add simplebench performance check as an optional CI for reasoning agent. It's only triggered when necessary and requires approval.
I've tested with Anthropic, Gemini, DeepSeek, committed a summary file.
The strongest results are from Anthropic (and Anthropic's chat UI scored the highest).
I've tested with Anthropic, Gemini, DeepSeek, committed a summary file.
The strongest results are from Anthropic (and Anthropic's chat UI scored the highest).
the numbers are really interesting, i was thinking lets not add in the chat UI results in the benchmarking of the ReasoningAgent, lets just restrict the results to only the reasoning agent?
cc @marklysze @sonichi @BabyCNM
Update: sorry i've misunderstood the results, i think the comparison looks amazing.
I mean, we can add simplebench performance check as an optional CI for reasoning agent. It's only triggered when necessary and requires approval.
sounds great, let me add the optional CI.
Thanks. How about adding the test into the contrib-openai CI?
added. please let me know if i missed anything. thanks! cc @sonichi
Thanks. How about adding the test into the contrib-openai CI?
added. please let me know if i missed anything. thanks! cc @sonichi
It's better than before. An even better approach is to make a separate workflow so that it's not bundled with other contrib-openai tests. @marklysze @BabyCNM @qingyun-wu what do you think is a good balance between convenience and cost control?
@Hk669 What is the status with this PR?
This is just an experimental PR, for anyone who wanted to run the simplebench on any agent.
should be a good starting point for the benchmark.
@Hk669 / @BabyCNM can we close this?