AutoGPT Full prompt test cases

Background

We should be tracking prompts that work successfully both to have an understanding of what's capable and also to ensure code changes don't break what's already working. Besides the unit tests, we need the end-to-end test cases for real world questions.

Changes

Add a set of test cases that seem to work consistently with the current state of AutoGPT
Add a script to execute those test cases
Add instructions to create one's own test cases

Documentation

tests/prompt_tests/README.md - Explains the usage and how to create one's own test cases
In-code comments

Test Plan

Tested each test case multiple times Tested script with [all] as well as individual test cases

PR Quality Checklist

[X] My pull request is atomic and focuses on a single change.
[X] I have thoroughly tested my changes with multiple different prompts.
[X] I have considered potential risks and mitigations for my changes.
[X] I have documented my changes clearly and comprehensively.
[X] I have not snuck in any "extra" small tweaks changes

Apr 14 '23 16:04 josephcmiller2

Everything looks good right now. I'm just trying to keep up with all the other merges :)

Apr 14 '23 20:04 josephcmiller2

20 cycles to run the test is going to be expensive, and there are multiple tests.
why so many similar tests ?
let's build the CI pipeline with it as well, because the test doesn't have much value if it doesn't break when people push stuff.
I would love it if instead of just checking for a file, you had an even more specific check

Can we just start with one test that you know will always work in less than, let's say 10 cycles ?

Apr 20 '23 22:04 waynehamadi

@josephcmiller2 you wanna join us on discord so we talk about it? We love that initiative, it's just we're wondering if there is a way to integrate either this:

as smoke test
or as a benchmark.

RIght now it's kind of sitting in between, because it's very simple and not very hard to achieve by the AI(good smoke test for consistency), but it's also very long (benchmark)

Apr 20 '23 22:04 waynehamadi

I'm on Discord. You can find in me in pull-requests or dev-autogpt

Apr 20 '23 23:04 josephcmiller2

what was the status of this? would love to get test coverage up asap

Apr 22 '23 07:04 ntindle

what was the status of this? would love to get test coverage up asap

I haven't received any feedback on why it isn't merged or what to do next. Push the devs on Discord to help get this through.

Apr 22 '23 14:04 josephcmiller2

@josephcmiller2 you received feedback from Merwane: https://github.com/Significant-Gravitas/Auto-GPT/pull/1354#issuecomment-1517021555 https://github.com/Significant-Gravitas/Auto-GPT/pull/1354#issuecomment-1517030562

To summarize our thoughts: tests that involve prompting the AI should either be usable as a benchmark to test the AI's efficacy, or as a smoke test that needs max 3 cycles to determine if something is broken. Everything outside of those brackets is a gray area that will cost us a lot of money in terms of API tokens.

Apr 24 '23 20:04 Pwuts

See also #1359

May 02 '23 11:05 Boostrix

This is a mass message from the AutoGPT core team. Our apologies for the ongoing delay in processing PRs. This is because we are re-architecting the AutoGPT core!

For more details (and for infor on joining our Discord), please refer to: https://github.com/Significant-Gravitas/Auto-GPT/wiki/Architecting

May 05 '23 00:05 p-i-

To summarize our thoughts: tests that involve prompting the AI should either be usable as a benchmark to test the AI's efficacy, or as a smoke test that needs max 3 cycles to determine if something is broken.

if autgpt.py were to be executable in a single-shot fashion via corresponding CLI args, we could have our cake and eat it ... aka, use auto-gpt itself to such tests, optionally. That would not even have to be a feature available in interactive mode, just via a startup flag / CLI argument. That way, you'd probably get a ton of data.

May 05 '23 18:05 Boostrix

tests that involve prompting the AI should either be usable as a benchmark to test the AI's efficacy, or as a smoke test that needs max 3 cycles to determine if something is broken. Everything outside of those brackets is a gray area that will cost us a lot of money in terms of API tokens.

Maybe we can have our cake and eat it by allowing folks to experiment with different "prompt profiles" for different purposes ? This sort of feature would be strictly opt-in (env level), but it would help us deal with all those PRs where people share their own prompt configs, and neatly organize things - while also providing an excellent way to benchmark/test (regressions!) things over time.

For the sake of regression testings, we should not underestimate the power of allowing people to create/share their own prompt profiles. This sort of feature would at least help close 10+ PRs here, immediately !

See: https://github.com/Significant-Gravitas/Auto-GPT/pull/1874#issuecomment-1537191222

May 07 '23 06:05 Boostrix

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
docs	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	May 21, 2023 1:21pm

May 21 '23 13:05 vercel[bot]

This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size

May 21 '23 13:05 github-actions[bot]

Closing as stale

May 31 '23 13:05 Pwuts

AutoGPT AutoGPT copied to clipboard

Full prompt test cases

Background

Changes

Documentation

Test Plan

PR Quality Checklist

AutoGPT
AutoGPT copied to clipboard