AutoGPT icon indicating copy to clipboard operation
AutoGPT copied to clipboard

Enhanced web browsing capabilities

Open Pwuts opened this issue 2 years ago • 5 comments

Background

Currently, all Auto-GPT can do is view_webpage. This returns the body text of the webpage, or a summarized version thereof. This is quite limiting compared to a human's capability to literally browse a website.

Also, it doesn't distinguish between primary content and annoying popups very well, clouding the LLM's perception of the page (partially addressed by #3519).

Proposal 🏗️

  • New command browse_website(url: str, task: str)
  • BrowserAgent that is attached to a browser window through a Selenium/Playwright instance
    • Executes the given task (e.g. "find the price of the book") and returns the result to the parent Agent
    • Specialized set of available actions/commands:
      • follow_link(url: str): asserts that the given link is on the page and navigates there
      • go_back(): goes to the previous page
      • make_screenshot()
      • report_task_result(result: str) for when the task has been completed
      • terminate(reason: str) for when the task can't be completed
      • input_text(input_field_id: str, text: str)
      • click_button(button_id: str)
      • ...
    • Tailored prompt
      1. AI config (personality)
      2. User-given task
      3. Browsing history
      4. Current page content (truncated / summary)
      5. Available actions
      6. Response format

Primary requirements

  • Must always report_task_result with a result that makes sense for the given task
  • Must recognize when it can't fulfill a task and terminate if so

Parts

  • [ ] #2454
  • [x] #3200
  • [ ] #3519

Related

  • #1981
  • #2443

Pwuts avatar Sep 08 '23 23:09 Pwuts

not sure if this is still relevant or not, but the last time I ran the browse command extensively, I would have needed a way to use a custom proxy per request to spawn multiple requests in parallel in order not to get API restricted

Boostrix avatar Oct 05 '23 21:10 Boostrix

just to clarify, is selenium functionality part of the repo already?

I did a search for selenium and saw references to it but I wasn't entirely sure since the ticket you posted regarding Web Nav mentioned Selenium/Playwright

BaseInfinity avatar Oct 08 '23 00:10 BaseInfinity

Pwuts answered on Discord that Selenium is already built in with the view_webpage command.

@Boostrix do you mind going into more detail why you need to spawn multiple requests in parallel and why that would reduce getting API restricted? Also what API restriction are you referring to?

Sorry for all the questions, just wanna understand =) Thanks in advance

BaseInfinity avatar Oct 08 '23 00:10 BaseInfinity

Depending on the use case, we may be limited by the server - eg when scraping

Boostrix avatar Oct 12 '23 22:10 Boostrix

This issue has automatically been marked as stale because it has not had any activity in the last 50 days. You can unstale it by commenting or removing the label. Otherwise, this issue will be closed in 10 days.

github-actions[bot] avatar Feb 19 '24 01:02 github-actions[bot]