AutoGPT icon indicating copy to clipboard operation
AutoGPT copied to clipboard

Help us build challenges!

Open waynehamadi opened this issue 2 years ago • 22 comments

Summary 💡

Challenges are tasks Auto-GPT is not able to achieve. Challenges will help us improve Auto-GPT

We need your help to build these challenges and the ecosystem around them

Here is a breakdown of the help we need.

A-Challenge Creation

1-Submit challenges within existing categories.

Memory

  • ~~https://github.com/Significant-Gravitas/Auto-GPT/issues/3838 => @dschonholtz~~
  • Memory challenge D => @Androbin

Information Retrieval

  • https://github.com/Significant-Gravitas/Auto-GPT/issues/3837 => @PortlandKyGuy
  • Information Retrieval Challenge C => @Androbin

Research

  • Research Challenge A => @Androbin

Psychological

  • https://github.com/Significant-Gravitas/Auto-GPT/issues/3871 => unassigned

Debug Code

  • https://github.com/Significant-Gravitas/Auto-GPT/issues/3836 @gravelBridge
  • https://github.com/Significant-Gravitas/Auto-GPT/issues/3935 @gravelBridge

Adaptability

  • https://github.com/Significant-Gravitas/Auto-GPT/issues/3937 => unassigned

Website Navigation Challenge

  • https://github.com/Significant-Gravitas/Auto-GPT/issues/3936 => unassigned

Self Improvement (Solve challenges automatically)

  • https://github.com/Significant-Gravitas/Auto-GPT/issues/3912 => unassigned

Automated Challenge Creation

  • https://github.com/Significant-Gravitas/Auto-GPT/issues/3917 => unassigned

2-Design brand new challenge categories

  • Alignment
  • Obtain Knowledge
  • Focus (stay on task)
  • Planning (suggested by @Boostrix)
  • Others if you have ideas. DM me if interested (discord below)

Challenges Auto-GPT can already perform

We currently have what we call regression tests. You can imagine regression tests as challenges Auto-GPT is already able to perform.

  • https://github.com/Significant-Gravitas/Auto-GPT/issues/3901 => unassigned

UX experience around challenges

  • https://github.com/Significant-Gravitas/Auto-GPT/issues/3839 => unassigned

Improve logs/DEBUG folder to help people beat challenges more easily

The logs/DEBUG folder allows everyone to understand what Auto-GPT is doing at every cycle Screenshot 2023-05-05 at 7 32 57 AM We need to:

  • ~~https://github.com/Significant-Gravitas/Auto-GPT/issues/3842 => @AndresCdo done thank you !~~
  • ~~https://github.com/Significant-Gravitas/Auto-GPT/issues/3843 => @AndresCdo done thank you !~~
  • https://github.com/Significant-Gravitas/Auto-GPT/issues/3960 => log ai_settings.yaml
  • https://github.com/Significant-Gravitas/Auto-GPT/issues/3844 => BLOCKED =>waiting for @Pwuts memory rework to assign
  • https://github.com/Significant-Gravitas/Auto-GPT/issues/3957 => BLOCKED => waiting for rearch

"Fix Auto-GPT" challenges !

The vision of the fix auto-gpt challenges is to give the tools for the community to report wrong behaviors and create challenges around them. We need to:

  • https://github.com/Significant-Gravitas/Auto-GPT/issues/3841 => unassigned
  • https://github.com/Significant-Gravitas/Auto-GPT/issues/3847 => unassigned

Make it easy for people to change the prompt

  • https://github.com/Significant-Gravitas/Auto-GPT/issues/3858 => @Eggrolling-hu
  • https://github.com/Significant-Gravitas/Auto-GPT/issues/3954 => idea brought up by @Boostrix

Build the CI pipeline !

DM me on discord if you have devops experience and want to help us build the pipeline that will allow people to submit their challenges and see how they perform!

  • https://github.com/Significant-Gravitas/Auto-GPT/issues/3966 => unassigned

Pytest Imrovements

  • ~~https://github.com/Significant-Gravitas/Auto-GPT/issues/3863 => @AndresCdo~~
  • https://github.com/Significant-Gravitas/Auto-GPT/issues/3900 => unassigned
  • https://github.com/Significant-Gravitas/Auto-GPT/issues/3906 => unassigned
  • https://github.com/Significant-Gravitas/Auto-GPT/issues/4199 => unassigned

Challenges refactorization

  • https://github.com/Significant-Gravitas/Auto-GPT/issues/3883 => unassigned
  • https://github.com/Significant-Gravitas/Auto-GPT/issues/3907 => unassigned
  • https://github.com/Significant-Gravitas/Auto-GPT/issues/4161 => unassigned
  • https://github.com/Significant-Gravitas/Auto-GPT/issues/4189 => unassigned
  • https://github.com/Significant-Gravitas/Auto-GPT/issues/4190 => unassigned

CI pipeline improvements

  • https://github.com/Significant-Gravitas/Auto-GPT/issues/4163 => unassigned
  • https://github.com/Significant-Gravitas/Auto-GPT/issues/4196 => @merwanehamadi -~~allow ci pipeline to make calls to api providers => @merwanehamadi~~ -https://github.com/Significant-Gravitas/Auto-GPT/issues/4258 => @rihp ##################### (issue to create:
  • create a test that makes sure the challenges have a corresponding doc
  • create a test maker, i.e a command that creates a test automatically )

Discord (merwanehamadi) Join Auto-GPT's discord channel: https://discord.gg/autogpt

waynehamadi avatar May 05 '23 12:05 waynehamadi

The section titled "others": should at least mention "planning", which is severely lacking at the moment. I bet all of us can come up with dozens of ai_settings files (objectives) where it fails to plan properly, or where it fails to recognize dependencies in between tasks, but also recognizing that it previously succeeded/completed a task and should proceed: https://github.com/Significant-Gravitas/Auto-GPT/issues/3593#issuecomment-1532870949

For starters, we probably need to have challenges that involve:

  • unconditional sequential tasks
  • conditional sequential tasks

(not getting into async/concurrency for now)

Boostrix avatar May 05 '23 19:05 Boostrix

Yeah this is a great suggestion @Boostrix can I talk to you in discord ? my discord is merwanehamadi. https://discord.gg/autogpt

waynehamadi avatar May 05 '23 21:05 waynehamadi

Boostrix has preference for staying off of Discord based on my prior interactions with them

anonhostpi avatar May 05 '23 21:05 anonhostpi

@Boostrix can you think of a challenge we could build in order to test planning skills ?

waynehamadi avatar May 06 '23 03:05 waynehamadi

@Androbin has suggested a very nice memory challenge involving files that are read in the wrong order. More details coming soon hopefully.

waynehamadi avatar May 06 '23 04:05 waynehamadi

can you think of a challenge we could build in order to test planning skills ?

For starters, I would consider a plan to be multi-objective task with multiple dimensions of leeway and constraints that the agent needs to explore.

So I suppose anything involving detangling dependencies (or lack thereof) should work to get this started. That would involve organizing steps but also coordinating in between steps. In #3593, @dschonholtz is juggling some nice ideas. And given the current state of things, it might actually be good to have working/non-working examples of plans for the agent, so that we can see which approach(es) look promising.

And in fact, GPT itself seems pretty good at coming up with potential candidates:

  • Grocery shopping: An AI agent could plan a grocery shopping trip by identifying the necessary items, determining the optimal route to take through the store to minimize time and energy, and considering any budget or dietary constraints.
  • Booking a flight: An AI agent could plan a flight booking by identifying the preferred travel dates, considering any budget or scheduling constraints, and selecting the best flight options based on price, duration, and other criteria.
  • Event planning: An AI agent could plan an event by identifying the necessary resources (such as venue, catering, and decorations), determining the optimal timing and scheduling, and considering any budget or other constraints.
  • Software development: An AI agent could plan a software development project by identifying the necessary features and functionality, breaking them down into smaller tasks and subtasks, and assigning them to team members based on their skills and availability.
  • Construction project management: An AI agent could plan a construction project by identifying the necessary resources (such as labor, materials, and equipment), coordinating with contractors and subcontractors, scheduling tasks and milestones, and monitoring progress to ensure that the project stays on track and within budget.

PS: I would suggest to extend the logging feature so that we can optionally log resource utilization to CSV files for these challenges - that way, we can trivially plot the performance of different challenges over time (different versions of Auto-GPT) - and in conjunction with suppport for #3466 (constraint awareness), we could also log how many steps [thinking] were needed for each challenge/benchmark (plan/task)

EDIT: To promote this effort a little more, I would suggest to add a few of these issues to the MOTD displayed when starting Auto-GPT, so that more users are encouraged to get involved in participating - we could randomly alternate between a handful of relevant issues and invite folks to particpate.

We currently have what we call regression tests. You can imagine regression tests as challenges Auto-GPT is already able to perform. We need your help to create more of these regression tests.

One of the most impressive examples posted around here is this one by @adam-paterson https://github.com/Significant-Gravitas/Auto-GPT/issues/2775#issuecomment-1517703255

Boostrix avatar May 06 '23 05:05 Boostrix

@Boostrix great suggestions! How do you measure success in a very deterministic way for these items ? I like the logging resource utilization suggestion. I will create an issue.

The example of regression you gave (entire website) is great! I guess we could use selenium to test the website. It looks like a great project.

waynehamadi avatar May 06 '23 12:05 waynehamadi

How do you measure success in a very deterministic way for these items ? right, good question.

I was thinking to use a simple test case for starters, one where we ask the "planner" to coordinate a meeting between N different people (say 3-4) who are constrained by availability (date/time) - later on, N can be increased, with more constraints added, such as certain people never being available on the same date. We can then use such a spec to generate a Pyton unit test using GPT to answer for each participant if that participant is available at the requested day/time.

So, the input for the unit test will be date/time for now (for each participant). The next step would be to ask Agent-GPT to plan the corresponding meeting by telling it about those constraints and specifying a potential time window (or list thereof).

To verify if the the solution is valid, we merely need to execute the unit test for each participant which will tell us if that participant is available - that way, we're reducing the problem to running tests.

Once we have that fleshed out/working for 3-4 participants, it would make sense to add more complexity to it by adding more options and constraints, including dependencies between these options and constraints.

We could then adapt this framework for other more complex examples (see above).

I like the logging resource utilization suggestion. I will create an issue.

For API costs/token (budget) we have several open PRs, number of steps taken is part of at least 2 PRs that I am aware of. The rest is probably in the realm of #3466

The example of regression you gave (entire website) is great! I guess we could use selenium to test the website. It looks like a great project.

I didn't even think about using Selenium, I was thinking of treating the final HTML like any XHTML/XML document and query it from Python to see if it's got the relevant tags and attributes as requested by the specs. Personally, I find the whole Selenium stuff mainly useful for highly dynamic websites, static HTML can probably be queried "as is" from Python - no need for Selenium ?

That being said, we could also use @adam-patersons example and add an outer agent to mutate his ai_settings / project.txt file to tinker with 3-5 variations of each step (file names, technology stack, functionality etc) - that way, you end up with a ton of regression tests at the mere cost of a few nested loops.The "deliverables" portion of his specs is such succinct that we could probably use it "as is" to create a corresponding python unit test via Auto-GPT (I have used this for several toy projects, and it's working nicely).

Boostrix avatar May 06 '23 13:05 Boostrix

Information Retrieval Challenge

FWIW, I can get it to bail out rather quickly by letting an outer agent mutate the following directive:

list of X open source software packages in the area of {audio, video, office, games, flight simulation}, [released under the {GPL, BSD ...], [working on {Windows, Mac OSX, Linux}]

Which is pretty interesting once you think about it, since this is the sort of stuff that LLM's like GPT are generally good at - and in fact, GPT can answer any of these easily, just not in combination with Auto-GPT, it seems like some sort of "intra-llm-regression" due to the interaction with the AI agent mechanism,

Boostrix avatar May 08 '23 13:05 Boostrix

@Boostrix I would agree that our current prompting artificially limits GPT-4 abilities. The issue I see is that we actively discourage long-form chain-of-thought reasoning.

Androbin avatar May 09 '23 01:05 Androbin

The idea to use dynamic prompting sounds rather promising:

  • #3937

Boostrix avatar May 09 '23 07:05 Boostrix

Here's another good description by @hashratez of an intra-llm-regression that should be suitable to benchmark the system against GPT itself:

#1647

I then tried the simplest task: create an index.html file with the word "hello". In direct ChatGPT it would output this in 1 second. This took over 10 steps as it writes the code, checks it then does something else, etc.

Boostrix avatar May 09 '23 21:05 Boostrix

we should update the list of potential challenges to add a new category for "experiential tests" - i.e. tests where the agent should be able to apply its experience of previously completing a task, but fails to do so.

The most straightforward example I can think of is it editing/updating a file successfully and 5 minutes later it wants to use interactive editors like nano, vim etc - that's a recurring topic here.

So we should add a benchmark to see how good an agent is at applying experience to future tasks.

A simple test case would be telling it to update a file with some hand holding, and afterwards leaving out the hand holding and counting how many times it succeeds or not (3/10 attempts).

Being able to apply past experience is an essential part of learning.

Here's another suggestion for a new category: "Tool use". The LLM has access to ~20 built-in functions (BIFs) - it should be able to use this access to make up for missing functionality. For instance: disabling browse_website, it should still be able to figure out internet access using execute_shell or execute_python. The number of steps it needs to do so, is a measure of its capability to use tools.

Likewise, after disabling download_file it should be able figure out using python or the shell to download stuff.

There are often several ways to "skin a cat" - when disabling git operations or the shell, the agent must be capable of figuring out alternatives, the number of steps it needs to do so tells us just how effective the agent is.

From a pytest standpoint, we would ideally be able to disable some BIFs and then run a test to see how many steps the test needs to complete - if, over time, that number increases, the agent is performing worse.

We can also ask the agent to perform tasks for which it does not have any BIFs at all, such as doing mathematical calculations #3412 and then count the number of steps it needs to come up with a solution using different options (python or shell).

Boostrix avatar May 11 '23 08:05 Boostrix

CI pipeline 4 times faster thanks to @AndresCdo and parallelized tests !

waynehamadi avatar May 12 '23 01:05 waynehamadi

Suggestion for new challenge type: In the light of the latest google/DDG dilemma (#4120), we need to harden our commands/agent accordingly (#4157):

  • work out a way to randomly make commands malfunction/disabled via pytest (using a timer or a rand function at the command mgr/registry level)
  • let the agent figure out that a command is malfunctioning, so that it comes up with an alternative

This may involve tinkering with different variable/argument substitutions: https://github.com/Significant-Gravitas/Auto-GPT/issues/3904#issuecomment-1545835975

Basically, we need to keep track of commands that previously worked/failed and types of arguments/params that are known to work - including some optional timeout option to retry a command once in a while.

Also, commands really should get access to the error/exception string - because at that point, the LLM can actually help us!

Boostrix avatar May 13 '23 08:05 Boostrix

a new issue was added

  • https://github.com/Significant-Gravitas/Auto-GPT/issues/4161

waynehamadi avatar May 13 '23 14:05 waynehamadi

@Boostrix yeah that sounds useful. We just have to be careful about inserting plausible mistakes

waynehamadi avatar May 13 '23 14:05 waynehamadi

@Boostrix could you create an issue for the 2 challenges you mentioned above and label it "challenge" ?

waynehamadi avatar May 13 '23 14:05 waynehamadi

Which ones exactly ?

Coming up with a challenge where we mutate a URL should be a pretty straightforward way for the agent tofail, which it still can fix - using URL validation/patching. Most obvious example would be adding whitespaces into the URL without escaping.

Boostrix avatar May 14 '23 07:05 Boostrix

all the ones you suggest, we don't have next step for them, can I just have the link of the issues where the challenge idea is written so I put them in this epic ?

waynehamadi avatar May 14 '23 12:05 waynehamadi

New CI pipeline ready: now you can test challenges by creating a Pull Request. currently working on this so you can SEE the results of the challenges: https://github.com/Significant-Gravitas/Auto-GPT/issues/4190

waynehamadi avatar May 14 '23 14:05 waynehamadi

thanks @AndresCdo for https://github.com/Significant-Gravitas/Auto-GPT/pull/3868

waynehamadi avatar May 17 '23 11:05 waynehamadi

thank you @PortlandKyGuy for your work! https://github.com/Significant-Gravitas/Auto-GPT/pull/4261

waynehamadi avatar May 31 '23 17:05 waynehamadi

thank you @erik-megarad https://github.com/Significant-Gravitas/Auto-GPT/pull/4469

waynehamadi avatar May 31 '23 18:05 waynehamadi

thank you @dschonholtz @gravelBridge https://github.com/Significant-Gravitas/Auto-GPT/pull/4286

waynehamadi avatar May 31 '23 18:05 waynehamadi

thank you @javableu !! https://github.com/Significant-Gravitas/Auto-GPT/pull/4167

waynehamadi avatar Jun 09 '23 22:06 waynehamadi

This issue has automatically been marked as stale because it has not had any activity in the last 50 days. You can unstale it by commenting or removing the label. Otherwise, this issue will be closed in 10 days.

github-actions[bot] avatar Sep 06 '23 20:09 github-actions[bot]

This issue was closed automatically because it has been stale for 10 days with no activity.

github-actions[bot] avatar Sep 17 '23 01:09 github-actions[bot]

unless I am mistaken, this should not be closed or "staled" at all, I believe this remains relevant or has something changed over the course of the last couple of months that I missing entirely ?

Boostrix avatar Oct 04 '23 17:10 Boostrix