AutoGPT icon indicating copy to clipboard operation
AutoGPT copied to clipboard

New Challenges: Information Retrieval for SpaceX, Anthropic, AutoGPT, Milvius

Open rihp opened this issue 2 years ago • 11 comments

Background

This is an information retrieval challenge to get AutoGPT to find new information that is not known already by the LLM

Changes

only the files for the challenge

Documentation

Test Plan

pytest tests/integration/challenges/information_retrieval/test_information_retrieval_challenge_b.py -v

PR Quality Checklist

  • [x] My pull request is atomic and focuses on a single change.
  • [ ] I have thoroughly tested my changes with multiple different prompts.
  • [x] I have considered potential risks and mitigations for my changes.
  • [x] I have documented my changes clearly and comprehensively.
  • [x] I have not snuck in any "extra" small tweaks changes

rihp avatar May 16 '23 11:05 rihp

Deployment failed with the following error:

Resource is limited - try again in 43 minutes (more than 100, code: "api-deployments-free-per-day").

vercel[bot] avatar May 16 '23 11:05 vercel[bot]

Related to #4244

rihp avatar May 16 '23 11:05 rihp

Codecov Report

Patch coverage has no change and project coverage change: +0.74 :tada:

Comparison is base (0839a16) 62.18% compared to head (af17010) 62.92%.

:exclamation: Current head af17010 differs from pull request most recent head c96a234. Consider uploading reports for the commit c96a234 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4245      +/-   ##
==========================================
+ Coverage   62.18%   62.92%   +0.74%     
==========================================
  Files          73       73              
  Lines        3345     3345              
  Branches      484      484              
==========================================
+ Hits         2080     2105      +25     
+ Misses       1118     1093      -25     
  Partials      147      147              

see 5 files with indirect coverage changes

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov[bot] avatar May 16 '23 11:05 codecov[bot]

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
docs ⬜️ Ignored (Inspect) Visit Preview May 17, 2023 11:18pm

vercel[bot] avatar May 16 '23 16:05 vercel[bot]

This challenge passes locally

rihp avatar May 16 '23 16:05 rihp

This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size

github-actions[bot] avatar May 16 '23 18:05 github-actions[bot]

This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size

github-actions[bot] avatar May 16 '23 19:05 github-actions[bot]

This challenge passes locally

an autonomous agent that specializes in researching and saving files as output.txt 

is all in agent_factory and I am not fully sure but I have intuition it has something with recent changes in memory handlers, as read_file vs write_file implementation differs slightly and it looks like it wants to save to output.txt but read from arxiv_paper.txt

2023-05-16T20:28:22.0290790Z
 Save the name of the main author to output.txt
\"name\": \"read_file\",\n        \"args\": {\n            \"filename\": \"arxiv_paper.txt\"\n  

piotrmasior avatar May 16 '23 22:05 piotrmasior

This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size

github-actions[bot] avatar May 17 '23 10:05 github-actions[bot]

Updated challenges and marked to skip the ones most difficult ones that arent passing consistently

rihp avatar May 17 '23 10:05 rihp

updated to include only the two tests that pass more consistenty

rihp avatar May 17 '23 17:05 rihp

updated to include only the two tests that pass more consistenty

at this point my thoughts are only, wtf is with the test infrastructure

piotrmasior avatar May 17 '23 20:05 piotrmasior

updated to include only the two tests that pass more consistenty

at this point my thoughts are only, wtf is with the test infrastructure

sure! what questions do you have more specifically ?

rihp avatar May 17 '23 20:05 rihp

updated to include only the two tests that pass more consistenty

at this point my thoughts are only, wtf is with the test infrastructure

sure! what questions do you have more specifically ?

Hey, I'm glad you asked. First, I see that you're struggling with a problem related to files, which was the main focus of my previous question. I believe it might be tied to the functioning of the test infrastructure, but I lack complete insights here.

However, since we're already in a discussion, I'd like to present my second point as both an opinion and a question. Why do integration tests validate that AutoGPT works when it's written in such a static way, still relying on bypasses and cycles? From what I can see in the code, you're manipulating the number of these cycles in hopes of merely passing the test for its own sake. But doesn't this approach fall short when tested on a larger scale? It seems you'd need to infinitely increase these cycles when creating more tests. Isn't this counterproductive?

piotrmasior avatar May 17 '23 21:05 piotrmasior

This pull request has conflicts with the base branch, please resolve those so we can evaluate the pull request.

github-actions[bot] avatar May 25 '23 18:05 github-actions[bot]

updated to include only the two tests that pass more consistenty

at this point my thoughts are only, wtf is with the test infrastructure

sure! what questions do you have more specifically ?

Hey, I'm glad you asked. First, I see that you're struggling with a problem related to files, which was the main focus of my previous question. I believe it might be tied to the functioning of the test infrastructure, but I lack complete insights here.

However, since we're already in a discussion, I'd like to present my second point as both an opinion and a question. Why do integration tests validate that AutoGPT works when it's written in such a static way, still relying on bypasses and cycles? From what I can see in the code, you're manipulating the number of these cycles in hopes of merely passing the test for its own sake. But doesn't this approach fall short when tested on a larger scale? It seems you'd need to infinitely increase these cycles when creating more tests. Isn't this counterproductive?

@piotrmasior join us on discord so we talk about it https://discord.gg/autogpt

still relying on bypasses and cycles yeah I don't like this bypass, we will change it to a budget OR define the number of cycles when initializing the agent, to really make it a PURE end to end test without mocks in the middle that rely on the underlying implementation. infinitely increase these cycles => right now each test can be done in less than 5 cycles on average, so for now we don't have this problem, but soon we will

waynehamadi avatar Jun 12 '23 16:06 waynehamadi