AutoGPT
AutoGPT copied to clipboard

Published 20 hours ago •

Significant-Gravitas

Reame
Issues

False believes challenge based on sally anne test.

Open javableu opened this issue 2 years ago • 19 comments

Background

Fixes issue #3871

Changes

Added a new community challenge.

Documentation

Added the documentation in the docs/challenges/memory/challenge_d.md

Test Plan

PR Quality Checklist

[ ] My pull request is atomic and focuses on a single change.
[ ] I have thoroughly tested my changes with multiple different prompts.
[ ] I have considered potential risks and mitigations for my changes.
[ ] I have documented my changes clearly and comprehensively.
[ ] I have not snuck in any "extra" small tweaks changes

May 13 '23 22:05 javableu

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
docs	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jun 9, 2023 9:59pm

May 13 '23 22:05 vercel[bot]

This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size

May 13 '23 22:05 github-actions[bot]

This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size

May 14 '23 08:05 github-actions[bot]

Deployment failed with the following error:

Resource is limited - try again in 28 minutes (more than 100, code: "api-deployments-free-per-day").

May 14 '23 08:05 vercel[bot]

This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size

May 14 '23 08:05 github-actions[bot]

Deployment failed with the following error:

Resource is limited - try again in 26 minutes (more than 100, code: "api-deployments-free-per-day").

May 14 '23 08:05 vercel[bot]

This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size

May 14 '23 08:05 github-actions[bot]

This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size

May 14 '23 10:05 github-actions[bot]

This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size

May 14 '23 10:05 github-actions[bot]

Deployment failed with the following error:

Resource is limited - try again in 1 hour (more than 100, code: "api-deployments-free-per-day").

May 14 '23 10:05 vercel[bot]

This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size

May 14 '23 10:05 github-actions[bot]

Deployment failed with the following error:

Resource is limited - try again in 50 minutes (more than 100, code: "api-deployments-free-per-day").

May 14 '23 10:05 vercel[bot]

This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size

May 14 '23 10:05 github-actions[bot]

Deployment failed with the following error:

Resource is limited - try again in 40 minutes (more than 100, code: "api-deployments-free-per-day").

May 14 '23 10:05 vercel[bot]

This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size

May 14 '23 10:05 github-actions[bot]

Deployment failed with the following error:

Resource is limited - try again in 37 minutes (more than 100, code: "api-deployments-free-per-day").

May 14 '23 11:05 vercel[bot]

This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size

May 14 '23 11:05 github-actions[bot]

Deployment failed with the following error:

Resource is limited - try again in 17 minutes (more than 100, code: "api-deployments-free-per-day").

May 14 '23 14:05 vercel[bot]

This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size

May 14 '23 14:05 github-actions[bot]

@javableu nice thanks ! don't hesitate to ping me next time, I will test it when I have time

Jun 03 '23 13:06 waynehamadi

@javableu on it

Jun 09 '23 19:06 waynehamadi

@javableu this is a challenge that requires incredible minutia and attention for details. It tests the LLM a lot more than AutoGPT, but I do believe at some point we will need autonomous agents to develop the skills required to beat this challenge.

Thank you very much.

Jun 09 '23 20:06 waynehamadi

You changed AutoGPT's behaviour. The cassettes have been updated and will be merged to the submodule when this Pull Request gets merged.

Jun 09 '23 22:06 Auto-GPT-Bot

Thanks @merwanehamadi for accepting this challenge. Even if the challenge as it is might be a bit too complicated, I think we could reuse it as a sort of "Swiss knife type of challenge". One storry==>many challenges that can be solved. For example, we could track the placements of the balls, the chronology of events, the discussions that they have had, what individuals have seen, and their beliefs of beliefs, ...

I have asked the chat to come up with some examples of questions/challenges. Most of them are interesting to us I believe. Can you tell me 2 or 3 challenges I should focus on from the list below?

Fact-checking and tracking: Track the true location of each marble at the end of the story. The challenge involves disregarding the misinformation given by the characters.

Timeline Reconstruction: Create a timeline of events, including when and where each marble was moved, and by whom.

Discrepancy Identification: Identify each instance where a character gives false information about the location of a marble.

Behavior Pattern Analysis: Analyze the behavior patterns of the characters. For example, does a certain character consistently lie about the location of the marbles?

Prediction: Predict the next series of events or actions based on the patterns established in the story.

Causality Analysis: Determine the cause and effect relationships in the story. How does each action lead to the next?

Lie Detection: Determine which character is most truthful and which is least based on their statements and actions.

Narrative Generation: Generate a continuation of the story based on the established patterns of character behavior.

Who would get most of the marble correctly? From the information provided, it's hard to determine who would get the most marbles correctly because it depends on how each character reacts to the false information they're given and how they decide to act upon it. However, Anne and Charlie seem to have the most interaction with the marbles, so they might have a better chance of tracking them correctly.

Who is the most greedy and try to keep things for themselves? From the story, it seems that Charlie might be the most inclined to keep things for himself, as he is seen putting marble D into his own basket.

Where is everyone at the end of the story? The last locations mentioned for each character are: Sally is outside, Anne leaves the room (presumably going outside), Charlie is in the room after re-entering, and Bob's last location is not specified after he places marble A in the blue box.

How many things did people see? This is a bit ambiguous. If you're referring to the number of actions involving the marbles each character has witnessed, it would require a detailed breakdown of each paragraph. Each character has seen a different number of actions based on when they were in the room.

What do people assume? It's hard to know for sure what each character assumes without more information, but each one likely assumes the information they're given about the marble locations is truthful, unless they know for sure it's false.

What do people know for sure without a doubt? Each character knows for sure the actions they themselves took with the marbles. However, due to the misinformation, they may not know the true current locations of all the marbles.

How many times did each person touch each marble? This would require a detailed count of each instance a character interacted with a marble, which can be done by parsing the text.

Who do they know is lying? Without additional context or dialogue, it's not clear if any character is aware that others are lying.

Which question should they ask to find every marble in the least amount of questions? The most efficient question might be: "Where did you last see or place each marble?" Asked to each person, this would ideally give the current location of each marble, assuming they all tell the truth.

Jun 12 '23 12:06 javableu