AutoGPT
AutoGPT copied to clipboard
Add reasoning challenge
Background
Proposed a new challenge: reasoning. Added temporal reasoning as challenge_a.
Changes
Added a new community challenge.
Documentation
Added docs/challenges/reasoning/challenge_a.md and docs/challenges/reasoning/introduction.md to document the new proposed challenge
Test Plan
comment out @pytest.mark.skip("This challenge hasn't been beaten yet.") and run
pytest -s tests/integration/challenges/reasoning/test_reasoning_challenge_a.py
PR Quality Checklist
- [X] My pull request is atomic and focuses on a single change.
- [ ] I have thoroughly tested my changes with multiple different prompts.
- [ ] I have considered potential risks and mitigations for my changes.
- [ ] I have documented my changes clearly and comprehensively.
- [ ] I have not snuck in any "extra" small tweaks changes
The latest updates on your projects. Learn more about Vercel for Git ↗︎
1 Ignored Deployment
| Name | Status | Preview | Comments | Updated (UTC) |
|---|---|---|---|---|
| docs | ⬜️ Ignored (Inspect) | Visit Preview | May 14, 2023 0:24am |
Codecov Report
Patch and project coverage have no change.
Comparison is base (
de6b8ee) 60.72% compared to head (15dd1e6) 60.72%.
Additional details and impacted files
@@ Coverage Diff @@
## master #3974 +/- ##
=======================================
Coverage 60.72% 60.72%
=======================================
Files 73 73
Lines 3320 3320
Branches 545 545
=======================================
Hits 2016 2016
Misses 1164 1164
Partials 140 140
| Impacted Files | Coverage Δ | |
|---|---|---|
| autogpt/llm/llm_utils.py | 66.66% <0.00%> (ø) |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
This is good from a general LLM perspective and could be used to benchmark. My concern here is that this is testing the LLM and is not really testing AutoGPT. That might be fine, because it could test the negative impact of our token overhead for running the agent vs GPT-4, but it doesn't really benchmark the agent. Ideally, we would have the agent traverse multiple files with different reasoning steps in them and be able to remember them.
I think this could actually be a good memory challenge 4?
Not only test the agents ability to remember several phrases across files, but also to reason between their contents?
Check out this code: https://github.com/Significant-Gravitas/Auto-GPT/blob/master/tests/integration/challenges/memory/test_memory_challenge_c.py
@merwanehamadi Let me know if you think I'm off base here.
It does start to blur the lines between info retrieval and memory and reasoning. But I think that becomes inevitable.
agreed with your all your points. Can anyone viewing this agree eventually we'll be working to these types of challenges in future for autogpt? (ie agent traversing + prev phrase recall + reasoning )
It is a good layered challenge. We're just not there yet - need to build out the basic ones that specifically test the agent.
@minfenglu can you make multiple files ? then it's probably better. Because right now you're literally testing only the LLM and Auto-GPT provides no values.
@merwanehamadi Sounds good, will make the changes.
This pull request has conflicts with the base branch, please resolve those so we can evaluate the pull request.
Conflicts have been resolved! 🎉 A maintainer will review the pull request shortly.
Deployment failed with the following error:
Resource is limited - try again in 4 minutes (more than 100, code: "api-deployments-free-per-day").
This pull request has conflicts with the base branch, please resolve those so we can evaluate the pull request.
@Significant-Gravitas/benchmarkers this PR may be interesting for the challenge it proposes.