autogen Introduces AutoGenBench

Why are these changes needed?

This PR introduced AutoGenBench -- a tool for running common benchmarks, and other templated tests, with the AutoGen framework. It replaces the former "testbed" tool, and lives in the same location in the repository: samples/tools/testbed

For full details, see the AutoGenBench README.md

TL;DR: It is a (pip) installable module that handles benchmarking and evaluation tasks. An example session might resemble the following:

autogenbench clone HumanEval
cd HumanEval
autogenbench run Tasks/r_human_eval_two_agents.jsonl
autogenbench tabulate results/r_human_eval_two_agents

Where:

autogenbench clone HumanEval downloads and expands the HumanEval benchmark scenario.
autogenbench run Tasks/r_human_eval_two_agents.jsonl runs the tasks defined in Tasks/r_human_eval_two_agents.jsonl
autogenbench tablue results/r_human_eval_two_agents tabulates the results of the run

Related issue number

Closes #995 , #987 , #996

Supersedes #997

Checks

[ ] I've included any doc changes needed for https://microsoft.github.io/autogen/. See https://microsoft.github.io/autogen/docs/Contribute#documentation to build and test documentation locally.
[ ] I've added tests (if relevant) corresponding to the changes introduced in this PR.
[ ] I've made sure all auto checks have passed.

Dec 22 '23 20:12 afourney

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (bcfd770) 32.48% compared to head (6089571) 32.48%. Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1048   +/-   ##
=======================================
  Coverage   32.48%   32.48%           
=======================================
  Files          41       41           
  Lines        4907     4907           
  Branches     1120     1120           
=======================================
  Hits         1594     1594           
  Misses       3187     3187           
  Partials      126      126

Flag	Coverage Δ
unittests	`32.44% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Dec 22 '23 21:12 codecov-commenter

@afourney code formatting issues were fixed by running pre-commit run --all-files from my side. Also fixed some minor wording issues.

Dec 25 '23 23:12 qingyun-wu

@afourney code formatting issues were fixed by running pre-commit run --all-files from my side. Also fixed some minor wording issues.

Thanks. I'll re-install pre-commit hooks on my end, and hope the issues don't come back.

Dec 27 '23 17:12 afourney

i'll try this out , seems really interesting and useful

Dec 30 '23 16:12 Josephrp

suggest renaming tag testbed to autogenbench post this merge for improved discovery of tags.

Jan 20 '24 22:01 gagb

Nice PR! Left very minor feedback. Otherwise looks fantastic!

Jan 20 '24 23:01 gagb

I added a draft of a blog post. @qingyun-wu to review.

Jan 22 '24 18:01 afourney

suggest renaming tag testbed to autogenbench post this merge for improved discovery of tags.

I'm going to give this a try, but it's going to be a little tough -- the autogenbench clone command works by linking to the repo, and the path is part of the current deployment so existing PyPI installs will at least temporarily break.

Edit: Oh, you mean the tag! Yes. I can do that.

Edit 2: I went ahead and renamed the folder too.

@gagb let me know if there is anything else

Jan 24 '24 06:01 afourney

NOTE: Once this is merged, and before deleting the "autogenbench" branch, we will need to follow-up with another super-quick PR to point "autogenbench clone" to the main branch. I can't do that now because the files don't exist in that branch yet, so manual testing will break. I suspect this will be a one-time issue.

Fixed. Points to main now. Use --branch autogenbench to override for testing.

Jan 24 '24 20:01 afourney

NOTE: Once this is merged, and before deleting the "autogenbench" branch, we will need to follow-up with another super-quick PR to point "autogenbench clone" to the main branch. I can't do that now because the files don't exist in that branch yet, so manual testing will break. I suspect this will be a one-time issue.

Can you fix in the code? First try via main and if the file doesn't exists fall back to the other branch?

Jan 25 '24 01:01 gagb

NOTE: Once this is merged, and before deleting the "autogenbench" branch, we will need to follow-up with another super-quick PR to point "autogenbench clone" to the main branch. I can't do that now because the files don't exist in that branch yet, so manual testing will break. I suspect this will be a one-time issue.

Can you fix in the code? First try via main and if the file doesn't exists fall back to the other branch?

I added a --branch switch to the clone command.

Now you can do:

autogenbench clone --branch autogenbench HumanEval

to clone from this branch. Otherwise it defaults to main.

Jan 25 '24 04:01 afourney

@ekzhu @qingyun-wu @victordibia @sonichi @rickyloynd-microsoft @julianakiseleva

Folks, I'm trying to get more eyeballs on this so that we can get this merged this week. Thanks @gagb for the earlier review.

If testing, be sure to use the "--branch autogenbench" parameter with the clone command since the files don't yet exist in main.

Jan 25 '24 19:01 afourney

Thanks, @afourney, for the tremendous effort and fantastic work! This might be minor (or not): folder names should all use lowercase and underscores following Python convention. For special terms e.g., benchmark names such as AutoGPT, GAIA, it may make sense to keep them as is, but other general names such as folders Scripts Templates, Tasks, I think it better to use all lowercase.

Jan 25 '24 23:01 qingyun-wu

Thanks, @afourney, for the tremendous effort and fantastic work! This might be minor (or not): folder names should all use lowercase and underscores following Python convention. For special terms e.g., benchmark names such as AutoGPT, GAIA, it may make sense to keep them as is, but other general names such as folders Scripts Templates, Tasks, I think it better to use all lowercase.

Yeah, that's unfortunately both minor (in terms of complexity), and also a lot of work to fix (in terms of files to change, and testing to do). Can this be handled in a PR after?

Minimally, I will have to change all the manifest files. All the init_tasks scripts. All the tabulation scripts. The documentation. And the templates.

Jan 25 '24 23:01 afourney

autogen autogen copied to clipboard

Introduces AutoGenBench

Why are these changes needed?

Related issue number

Checks

Codecov Report

autogen
autogen copied to clipboard