Add Screenshot functionality to `browse_website`
Background
Added the screenshot options during a browser call I suggested here https://github.com/Significant-Gravitas/Auto-GPT/issues/2443
Changes
Added the functionality to make a screenshot on every browser call.
Documentation
added a own function, added a call after the url call in the browsing function.
Test Plan
Tested it with chrome inside my DEV Container many times for different websites, withour any error.
PR Quality Checklist
- [x ] My pull request is atomic and focuses on a single change.
- [x ] I have thoroughly tested my changes with multiple different prompts.
- [x ] I have considered potential risks and mitigations for my changes.
- [x ] I have documented my changes clearly and comprehensively.
This pull request has conflicts with the base branch, please resolve those so we can evaluate the pull request.
Could you make this a general command that allows screenshot of the full desktop or an option to do so? The use-case is to allow autoGPT to use the PC (and any application installed) as a tool to accomplish its goals.
I have not opened this PR yet but here is what I'm working on.
- We need the screenshot command to save a screenshot to a file. ex: screenshot.png
- This file is given to Meta SAM for segmentation into icons, buttons, text, etc. The output from SAM has everything we need.
- Ask chatGPT: "Given the following segmentations of a PC desktop where should I click to accomplish the {goal/task}. Reply in json format with the X,Y coordinates of where to double-click or the text to enter via the keyboard"
- Now we need mouse and keyboard commands. Use the mouse command to move the mouse to X,Y and double-click.
- Use the keyboard command to enter the Text.
- REPEAT - But we need some prompt engineering on 3 to make a general purpose prompt that returns the mouse X,Y or Text to be entered.
Once these commands are available I expect autoGPT to use the screenshot, mouse, and keyboard commands to interact with the desktop and use every app available to accomplish the goals.
ChatGPT already knows how to use every application there is - we just need to give it access to the desktop, mouse, and keyboard. AutoGPT is capable of this now with the addition of a few commands and a 1 prompt.
Segmenting the desktop screenshot will produce app icon locations with a bounding box. If an application is open on the desktop the SAM output would naturally include a description of that content as well.
Given the looping nature of autoGPT it should continue using the PC tools to accomplish goals.
I had the same idea, but what if it could be made with OpenCV for desktop object detection in real-time, without the use of Meta SAM. And along with that, we could make it use PyAutoGUI. It let Python scripts control the mouse and keyboard to automate interactions with other applications. GPT-4 is able to write reliable code for interacting with PyAutoGUI and OpenCV
AutoGPT with this functionality will be able to achieve literally anything 👌
openCV for desktop object detection in real-time
I don't think it matters where the source of 'desktop info' comes from - anything that can produce a description of the screen in json format I would expect to work. As long as it contains a list of icons, apps, menus, text and their coordinates or bounding boxes. Any decent LLM should be able to take this json + prompt and produce a json containing our list of keyboard/mouse actions.
I was looking into screen readers like NVDA
This is a mass message from the AutoGPT core team. Our apologies for the ongoing delay in processing PRs. This is because we are re-architecting the AutoGPT core!
For more details (and for infor on joining our Discord), please refer to: https://github.com/Significant-Gravitas/Auto-GPT/wiki/Architecting
This could be very useful for debugging purposes
Conflicts have been resolved! 🎉 A maintainer will review the pull request shortly.
Deploy Preview for auto-gpt-docs ready!
| Name | Link |
|---|---|
| Latest commit | d74f5c87644d6a1d6c7a9abe92405dc11153d600 |
| Latest deploy log | https://app.netlify.com/sites/auto-gpt-docs/deploys/64a841be4686bd0008a2d8ce |
| Deploy Preview | https://deploy-preview-2454--auto-gpt-docs.netlify.app/ |
| Preview on mobile | Toggle QR Code...Use your smartphone camera to open QR code link. |
To edit notification comments on pull requests, go to your Netlify site configuration.
This pull request has conflicts with the base branch, please resolve those so we can evaluate the pull request.
Commenting for personal reference, may pick up this task and work to resolve this PR when I have time.