AutoGPT
AutoGPT copied to clipboard
Full prompt test cases
Background
We should be tracking prompts that work successfully both to have an understanding of what's capable and also to ensure code changes don't break what's already working. Besides the unit tests, we need the end-to-end test cases for real world questions.
Changes
- Add a set of test cases that seem to work consistently with the current state of AutoGPT
- Add a script to execute those test cases
- Add instructions to create one's own test cases
Documentation
- tests/prompt_tests/README.md - Explains the usage and how to create one's own test cases
- In-code comments
Test Plan
Tested each test case multiple times Tested script with [all] as well as individual test cases
PR Quality Checklist
- [X] My pull request is atomic and focuses on a single change.
- [X] I have thoroughly tested my changes with multiple different prompts.
- [X] I have considered potential risks and mitigations for my changes.
- [X] I have documented my changes clearly and comprehensively.
- [X] I have not snuck in any "extra" small tweaks changes
Everything looks good right now. I'm just trying to keep up with all the other merges :)
- 20 cycles to run the test is going to be expensive, and there are multiple tests.
- why so many similar tests ?
- let's build the CI pipeline with it as well, because the test doesn't have much value if it doesn't break when people push stuff.
- I would love it if instead of just checking for a file, you had an even more specific check
Can we just start with one test that you know will always work in less than, let's say 10 cycles ?
@josephcmiller2 you wanna join us on discord so we talk about it? We love that initiative, it's just we're wondering if there is a way to integrate either this:
- as smoke test
- or as a benchmark.
RIght now it's kind of sitting in between, because it's very simple and not very hard to achieve by the AI(good smoke test for consistency), but it's also very long (benchmark)
I'm on Discord. You can find in me in pull-requests or dev-autogpt
what was the status of this? would love to get test coverage up asap
what was the status of this? would love to get test coverage up asap
I haven't received any feedback on why it isn't merged or what to do next. Push the devs on Discord to help get this through.
@josephcmiller2 you received feedback from Merwane: https://github.com/Significant-Gravitas/Auto-GPT/pull/1354#issuecomment-1517021555 https://github.com/Significant-Gravitas/Auto-GPT/pull/1354#issuecomment-1517030562
To summarize our thoughts: tests that involve prompting the AI should either be usable as a benchmark to test the AI's efficacy, or as a smoke test that needs max 3 cycles to determine if something is broken. Everything outside of those brackets is a gray area that will cost us a lot of money in terms of API tokens.
See also #1359
This is a mass message from the AutoGPT core team. Our apologies for the ongoing delay in processing PRs. This is because we are re-architecting the AutoGPT core!
For more details (and for infor on joining our Discord), please refer to: https://github.com/Significant-Gravitas/Auto-GPT/wiki/Architecting
To summarize our thoughts: tests that involve prompting the AI should either be usable as a benchmark to test the AI's efficacy, or as a smoke test that needs max 3 cycles to determine if something is broken.
if autgpt.py were to be executable in a single-shot fashion via corresponding CLI args, we could have our cake and eat it ... aka, use auto-gpt itself to such tests, optionally. That would not even have to be a feature available in interactive mode, just via a startup flag / CLI argument. That way, you'd probably get a ton of data.
tests that involve prompting the AI should either be usable as a benchmark to test the AI's efficacy, or as a smoke test that needs max 3 cycles to determine if something is broken. Everything outside of those brackets is a gray area that will cost us a lot of money in terms of API tokens.
Maybe we can have our cake and eat it by allowing folks to experiment with different "prompt profiles" for different purposes ? This sort of feature would be strictly opt-in (env level), but it would help us deal with all those PRs where people share their own prompt configs, and neatly organize things - while also providing an excellent way to benchmark/test (regressions!) things over time.
For the sake of regression testings, we should not underestimate the power of allowing people to create/share their own prompt profiles. This sort of feature would at least help close 10+ PRs here, immediately !
See: https://github.com/Significant-Gravitas/Auto-GPT/pull/1874#issuecomment-1537191222
The latest updates on your projects. Learn more about Vercel for Git ↗︎
| Name | Status | Preview | Comments | Updated (UTC) |
|---|---|---|---|---|
| docs | ✅ Ready (Inspect) | Visit Preview | 💬 Add feedback | May 21, 2023 1:21pm |
This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size
Closing as stale