AutoGPT icon indicating copy to clipboard operation
AutoGPT copied to clipboard

Add Screenshot functionality to `browse_website`

Open horazius opened this issue 3 years ago • 10 comments

Background

Added the screenshot options during a browser call I suggested here https://github.com/Significant-Gravitas/Auto-GPT/issues/2443

Changes

Added the functionality to make a screenshot on every browser call.

Documentation

added a own function, added a call after the url call in the browsing function.

Test Plan

Tested it with chrome inside my DEV Container many times for different websites, withour any error.

PR Quality Checklist

  • [x ] My pull request is atomic and focuses on a single change.
  • [x ] I have thoroughly tested my changes with multiple different prompts.
  • [x ] I have considered potential risks and mitigations for my changes.
  • [x ] I have documented my changes clearly and comprehensively.

horazius avatar Apr 18 '23 22:04 horazius

This pull request has conflicts with the base branch, please resolve those so we can evaluate the pull request.

github-actions[bot] avatar Apr 18 '23 23:04 github-actions[bot]

Could you make this a general command that allows screenshot of the full desktop or an option to do so? The use-case is to allow autoGPT to use the PC (and any application installed) as a tool to accomplish its goals.

I have not opened this PR yet but here is what I'm working on.

  1. We need the screenshot command to save a screenshot to a file. ex: screenshot.png
  2. This file is given to Meta SAM for segmentation into icons, buttons, text, etc. The output from SAM has everything we need.
  3. Ask chatGPT: "Given the following segmentations of a PC desktop where should I click to accomplish the {goal/task}. Reply in json format with the X,Y coordinates of where to double-click or the text to enter via the keyboard"
  4. Now we need mouse and keyboard commands. Use the mouse command to move the mouse to X,Y and double-click.
  5. Use the keyboard command to enter the Text.
  6. REPEAT - But we need some prompt engineering on 3 to make a general purpose prompt that returns the mouse X,Y or Text to be entered.

Once these commands are available I expect autoGPT to use the screenshot, mouse, and keyboard commands to interact with the desktop and use every app available to accomplish the goals.

ChatGPT already knows how to use every application there is - we just need to give it access to the desktop, mouse, and keyboard. AutoGPT is capable of this now with the addition of a few commands and a 1 prompt.

Segmenting the desktop screenshot will produce app icon locations with a bounding box. If an application is open on the desktop the SAM output would naturally include a description of that content as well.

Given the looping nature of autoGPT it should continue using the PC tools to accomplish goals.

Tonylib avatar Apr 19 '23 04:04 Tonylib

I had the same idea, but what if it could be made with OpenCV for desktop object detection in real-time, without the use of Meta SAM. And along with that, we could make it use PyAutoGUI. It let Python scripts control the mouse and keyboard to automate interactions with other applications. GPT-4 is able to write reliable code for interacting with PyAutoGUI and OpenCV

AutoGPT with this functionality will be able to achieve literally anything 👌

Reno-Codes avatar Apr 19 '23 20:04 Reno-Codes

openCV for desktop object detection in real-time

I don't think it matters where the source of 'desktop info' comes from - anything that can produce a description of the screen in json format I would expect to work. As long as it contains a list of icons, apps, menus, text and their coordinates or bounding boxes. Any decent LLM should be able to take this json + prompt and produce a json containing our list of keyboard/mouse actions.

I was looking into screen readers like NVDA

Tonylib avatar Apr 21 '23 05:04 Tonylib

This is a mass message from the AutoGPT core team. Our apologies for the ongoing delay in processing PRs. This is because we are re-architecting the AutoGPT core!

For more details (and for infor on joining our Discord), please refer to: https://github.com/Significant-Gravitas/Auto-GPT/wiki/Architecting

p-i- avatar May 05 '23 00:05 p-i-

This could be very useful for debugging purposes

Pwuts avatar Jun 14 '23 23:06 Pwuts

Conflicts have been resolved! 🎉 A maintainer will review the pull request shortly.

github-actions[bot] avatar Jul 07 '23 16:07 github-actions[bot]

Deploy Preview for auto-gpt-docs ready!

Name Link
Latest commit d74f5c87644d6a1d6c7a9abe92405dc11153d600
Latest deploy log https://app.netlify.com/sites/auto-gpt-docs/deploys/64a841be4686bd0008a2d8ce
Deploy Preview https://deploy-preview-2454--auto-gpt-docs.netlify.app/
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

netlify[bot] avatar Jul 07 '23 16:07 netlify[bot]

This pull request has conflicts with the base branch, please resolve those so we can evaluate the pull request.

github-actions[bot] avatar Aug 01 '23 18:08 github-actions[bot]

Commenting for personal reference, may pick up this task and work to resolve this PR when I have time.

LHamnett avatar Mar 07 '24 15:03 LHamnett