AutoGPT
AutoGPT copied to clipboard
Enhanced web browsing capabilities
Background
Currently, all Auto-GPT can do is view_webpage. This returns the body text of the webpage, or a summarized version thereof. This is quite limiting compared to a human's capability to literally browse a website.
Also, it doesn't distinguish between primary content and annoying popups very well, clouding the LLM's perception of the page (partially addressed by #3519).
Proposal 🏗️
- New command
browse_website(url: str, task: str) BrowserAgentthat is attached to a browser window through a Selenium/Playwright instance- Executes the given task (e.g. "find the price of the book") and returns the result to the parent
Agent - Specialized set of available actions/commands:
follow_link(url: str): asserts that the given link is on the page and navigates therego_back(): goes to the previous pagemake_screenshot()report_task_result(result: str)for when the task has been completedterminate(reason: str)for when the task can't be completedinput_text(input_field_id: str, text: str)click_button(button_id: str)- ...
- Tailored prompt
- AI config (personality)
- User-given task
- Browsing history
- Current page content (truncated / summary)
- Available actions
- Response format
- Executes the given task (e.g. "find the price of the book") and returns the result to the parent
Primary requirements
- Must always
report_task_resultwith a result that makes sense for the given task - Must recognize when it can't fulfill a task and
terminateif so
Parts
- [ ] #2454
- [x] #3200
- [ ] #3519
Related
- #1981
- #2443
not sure if this is still relevant or not, but the last time I ran the browse command extensively, I would have needed a way to use a custom proxy per request to spawn multiple requests in parallel in order not to get API restricted
just to clarify, is selenium functionality part of the repo already?
I did a search for selenium and saw references to it but I wasn't entirely sure since the ticket you posted regarding Web Nav mentioned Selenium/Playwright
Pwuts answered on Discord that Selenium is already built in with the view_webpage command.
@Boostrix do you mind going into more detail why you need to spawn multiple requests in parallel and why that would reduce getting API restricted? Also what API restriction are you referring to?
Sorry for all the questions, just wanna understand =) Thanks in advance
Depending on the use case, we may be limited by the server - eg when scraping
This issue has automatically been marked as stale because it has not had any activity in the last 50 days. You can unstale it by commenting or removing the label. Otherwise, this issue will be closed in 10 days.