autogen
autogen copied to clipboard
Introduces AutoGenBench
Why are these changes needed?
This PR introduced AutoGenBench -- a tool for running common benchmarks, and other templated tests, with the AutoGen framework. It replaces the former "testbed" tool, and lives in the same location in the repository: samples/tools/testbed
For full details, see the AutoGenBench README.md
TL;DR: It is a (pip) installable module that handles benchmarking and evaluation tasks. An example session might resemble the following:
autogenbench clone HumanEval
cd HumanEval
autogenbench run Tasks/r_human_eval_two_agents.jsonl
autogenbench tabulate results/r_human_eval_two_agents
Where:
-
autogenbench clone HumanEval
downloads and expands the HumanEval benchmark scenario. -
autogenbench run Tasks/r_human_eval_two_agents.jsonl
runs the tasks defined inTasks/r_human_eval_two_agents.jsonl
-
autogenbench tablue results/r_human_eval_two_agents
tabulates the results of the run
Related issue number
Closes #995 , #987 , #996
Supersedes #997
Checks
- [ ] I've included any doc changes needed for https://microsoft.github.io/autogen/. See https://microsoft.github.io/autogen/docs/Contribute#documentation to build and test documentation locally.
- [ ] I've added tests (if relevant) corresponding to the changes introduced in this PR.
- [ ] I've made sure all auto checks have passed.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Comparison is base (
bcfd770
) 32.48% compared to head (6089571
) 32.48%. Report is 1 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #1048 +/- ##
=======================================
Coverage 32.48% 32.48%
=======================================
Files 41 41
Lines 4907 4907
Branches 1120 1120
=======================================
Hits 1594 1594
Misses 3187 3187
Partials 126 126
Flag | Coverage Δ | |
---|---|---|
unittests | 32.44% <ø> (ø) |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@afourney code formatting issues were fixed by running pre-commit run --all-files
from my side. Also fixed some minor wording issues.
@afourney code formatting issues were fixed by running
pre-commit run --all-files
from my side. Also fixed some minor wording issues.
Thanks. I'll re-install pre-commit hooks on my end, and hope the issues don't come back.
i'll try this out , seems really interesting and useful
suggest renaming tag testbed
to autogenbench
post this merge for improved discovery of tags.
Nice PR! Left very minor feedback. Otherwise looks fantastic!
I added a draft of a blog post. @qingyun-wu to review.
suggest renaming tag
testbed
toautogenbench
post this merge for improved discovery of tags.
I'm going to give this a try, but it's going to be a little tough -- the autogenbench clone command works by linking to the repo, and the path is part of the current deployment so existing PyPI installs will at least temporarily break.
Edit: Oh, you mean the tag! Yes. I can do that.
Edit 2: I went ahead and renamed the folder too.
@gagb let me know if there is anything else
NOTE: Once this is merged, and before deleting the "autogenbench" branch, we will need to follow-up with another super-quick PR to point "autogenbench clone" to the main branch. I can't do that now because the files don't exist in that branch yet, so manual testing will break. I suspect this will be a one-time issue.
Fixed. Points to main now. Use --branch autogenbench
to override for testing.
NOTE: Once this is merged, and before deleting the "autogenbench" branch, we will need to follow-up with another super-quick PR to point "autogenbench clone" to the main branch. I can't do that now because the files don't exist in that branch yet, so manual testing will break. I suspect this will be a one-time issue.
Can you fix in the code? First try via main and if the file doesn't exists fall back to the other branch?
NOTE: Once this is merged, and before deleting the "autogenbench" branch, we will need to follow-up with another super-quick PR to point "autogenbench clone" to the main branch. I can't do that now because the files don't exist in that branch yet, so manual testing will break. I suspect this will be a one-time issue.
Can you fix in the code? First try via main and if the file doesn't exists fall back to the other branch?
I added a --branch switch to the clone command.
Now you can do:
autogenbench clone --branch autogenbench HumanEval
to clone from this branch. Otherwise it defaults to main.
@ekzhu @qingyun-wu @victordibia @sonichi @rickyloynd-microsoft @julianakiseleva
Folks, I'm trying to get more eyeballs on this so that we can get this merged this week. Thanks @gagb for the earlier review.
If testing, be sure to use the "--branch autogenbench" parameter with the clone command since the files don't yet exist in main.
Thanks, @afourney, for the tremendous effort and fantastic work! This might be minor (or not): folder names should all use lowercase and underscores following Python convention. For special terms e.g., benchmark names such as AutoGPT, GAIA, it may make sense to keep them as is, but other general names such as folders Scripts
Templates
, Tasks
, I think it better to use all lowercase.
Thanks, @afourney, for the tremendous effort and fantastic work! This might be minor (or not): folder names should all use lowercase and underscores following Python convention. For special terms e.g., benchmark names such as AutoGPT, GAIA, it may make sense to keep them as is, but other general names such as folders
Scripts
Templates
,Tasks
, I think it better to use all lowercase.
Yeah, that's unfortunately both minor (in terms of complexity), and also a lot of work to fix (in terms of files to change, and testing to do). Can this be handled in a PR after?
Minimally, I will have to change all the manifest files. All the init_tasks scripts. All the tabulation scripts. The documentation. And the templates.