ag2 icon indicating copy to clipboard operation
ag2 copied to clipboard

ReasoningAgent benchmarking with SimpleBench

Open Hk669 opened this issue 1 year ago • 12 comments

Why are these changes needed?

a draft PR for running the simple bench with ReasoningAgent and this PR is not meant to be merged. source: https://simple-bench.com/

The benchmark results on the sample data (10 prompts) with the gpt-4o-mini is 20%.

Related issue number

Checks

  • [ ] I've included any doc changes needed for https://docs.ag2.ai/. See https://docs.ag2.ai/docs/contributor-guide/documentation to build and test documentation locally.
  • [ ] I've added tests (if relevant) corresponding to the changes introduced in this PR.
  • [ ] I've made sure all auto checks have passed.

Hk669 avatar Dec 26 '24 14:12 Hk669

Thanks. How about adding the test into the contrib-openai CI?

sonichi avatar Dec 26 '24 17:12 sonichi

Thanks. How about adding the test into the contrib-openai CI?

can you please mention if it is for the ReasoningAgent or for the Benchmark? fyi: the ci tests for the reasoningagent are under process in the PR https://github.com/ag2ai/ag2/pull/294

Hk669 avatar Jan 01 '25 10:01 Hk669

Thanks. How about adding the test into the contrib-openai CI?

can you please mention if it is for the ReasoningAgent or for the Benchmark? fyi: the ci tests for the reasoningagent are under process in the PR #294

I mean, we can add simplebench performance check as an optional CI for reasoning agent. It's only triggered when necessary and requires approval.

sonichi avatar Jan 01 '25 20:01 sonichi

I've tested with Anthropic, Gemini, DeepSeek, committed a summary file.

See here.

The strongest results are from Anthropic (and Anthropic's chat UI scored the highest).

marklysze avatar Jan 01 '25 23:01 marklysze

I've tested with Anthropic, Gemini, DeepSeek, committed a summary file.

See here.

The strongest results are from Anthropic (and Anthropic's chat UI scored the highest).

the numbers are really interesting, i was thinking lets not add in the chat UI results in the benchmarking of the ReasoningAgent, lets just restrict the results to only the reasoning agent?

cc @marklysze @sonichi @BabyCNM

Update: sorry i've misunderstood the results, i think the comparison looks amazing.

Hk669 avatar Jan 02 '25 04:01 Hk669

I mean, we can add simplebench performance check as an optional CI for reasoning agent. It's only triggered when necessary and requires approval.

sounds great, let me add the optional CI.

Hk669 avatar Jan 02 '25 05:01 Hk669

Thanks. How about adding the test into the contrib-openai CI?

added. please let me know if i missed anything. thanks! cc @sonichi

Hk669 avatar Jan 02 '25 17:01 Hk669

Thanks. How about adding the test into the contrib-openai CI?

added. please let me know if i missed anything. thanks! cc @sonichi

It's better than before. An even better approach is to make a separate workflow so that it's not bundled with other contrib-openai tests. @marklysze @BabyCNM @qingyun-wu what do you think is a good balance between convenience and cost control?

sonichi avatar Jan 02 '25 17:01 sonichi

@Hk669 What is the status with this PR?

davorrunje avatar Feb 12 '25 20:02 davorrunje

This is just an experimental PR, for anyone who wanted to run the simplebench on any agent.

should be a good starting point for the benchmark.

Hk669 avatar Feb 13 '25 00:02 Hk669

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Feb 26 '25 20:02 CLAassistant

@Hk669 / @BabyCNM can we close this?

marklysze avatar Sep 16 '25 19:09 marklysze