AutoGPT Add reasoning challenge

Background

Proposed a new challenge: reasoning. Added temporal reasoning as challenge_a.

Changes

Added a new community challenge.

Documentation

Added docs/challenges/reasoning/challenge_a.md and docs/challenges/reasoning/introduction.md to document the new proposed challenge

Test Plan

comment out @pytest.mark.skip("This challenge hasn't been beaten yet.") and run

pytest -s tests/integration/challenges/reasoning/test_reasoning_challenge_a.py

PR Quality Checklist

[X] My pull request is atomic and focuses on a single change.
[ ] I have thoroughly tested my changes with multiple different prompts.
[ ] I have considered potential risks and mitigations for my changes.
[ ] I have documented my changes clearly and comprehensively.
[ ] I have not snuck in any "extra" small tweaks changes

May 07 '23 20:05 minfenglu

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
docs	⬜️ Ignored (Inspect)	Visit Preview		May 14, 2023 0:24am

May 07 '23 20:05 vercel[bot]

Codecov Report

Patch and project coverage have no change.

Comparison is base (de6b8ee) 60.72% compared to head (15dd1e6) 60.72%.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #3974   +/-   ##
=======================================
  Coverage   60.72%   60.72%           
=======================================
  Files          73       73           
  Lines        3320     3320           
  Branches      545      545           
=======================================
  Hits         2016     2016           
  Misses       1164     1164           
  Partials      140      140

Impacted Files	Coverage Δ
autogpt/llm/llm_utils.py	`66.66% <0.00%> (ø)`

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

May 10 '23 00:05 codecov[bot]

This is good from a general LLM perspective and could be used to benchmark. My concern here is that this is testing the LLM and is not really testing AutoGPT. That might be fine, because it could test the negative impact of our token overhead for running the agent vs GPT-4, but it doesn't really benchmark the agent. Ideally, we would have the agent traverse multiple files with different reasoning steps in them and be able to remember them.

I think this could actually be a good memory challenge 4?

Not only test the agents ability to remember several phrases across files, but also to reason between their contents?

Check out this code: https://github.com/Significant-Gravitas/Auto-GPT/blob/master/tests/integration/challenges/memory/test_memory_challenge_c.py

@merwanehamadi Let me know if you think I'm off base here.

It does start to blur the lines between info retrieval and memory and reasoning. But I think that becomes inevitable.

May 12 '23 16:05 dschonholtz

agreed with your all your points. Can anyone viewing this agree eventually we'll be working to these types of challenges in future for autogpt? (ie agent traversing + prev phrase recall + reasoning )

It is a good layered challenge. We're just not there yet - need to build out the basic ones that specifically test the agent.

May 12 '23 22:05 MacroMuppet

@minfenglu can you make multiple files ? then it's probably better. Because right now you're literally testing only the LLM and Auto-GPT provides no values.

May 13 '23 03:05 waynehamadi

@merwanehamadi Sounds good, will make the changes.

May 13 '23 05:05 minfenglu

This pull request has conflicts with the base branch, please resolve those so we can evaluate the pull request.

May 13 '23 19:05 github-actions[bot]

Conflicts have been resolved! 🎉 A maintainer will review the pull request shortly.

May 14 '23 00:05 github-actions[bot]

Deployment failed with the following error:

Resource is limited - try again in 4 minutes (more than 100, code: "api-deployments-free-per-day").

May 14 '23 00:05 vercel[bot]

This pull request has conflicts with the base branch, please resolve those so we can evaluate the pull request.

May 25 '23 18:05 github-actions[bot]

@Significant-Gravitas/benchmarkers this PR may be interesting for the challenge it proposes.

Sep 08 '23 12:09 Pwuts

AutoGPT AutoGPT copied to clipboard

Add reasoning challenge

Background

Changes

Documentation

Test Plan

PR Quality Checklist

Codecov Report

AutoGPT
AutoGPT copied to clipboard