✨ Refined Vision Prompt: Integration of Keyboard Shortcuts over Search Function

This PR proposes a significant methodological enhancement to the VISION_PROMPT framework. I'm proposing the PRESS action as a more efficient alternative to the existing SEARCH function. This update is aimed at refining the interaction model to better align with intuitive, real-world navigation methods. The proposed changes underscore several key benefits:

💻 Enhanced Shortcut Utilization: The PRESS action could be a single or a combination of key presses, applicable in any active window that the operator engages with. This approach significantly streamlines navigation across different operating systems by using known shortcuts. This change also offers a more precise and efficient means of interaction:

VISION_PROMPT = """
...
1. PRESS - Recommend keyboard shortcuts for efficient navigation and interaction.
   Response Format: PRESS "{{key combination}}", detailing the specific action's utility.
...
Example Response:
For tasks in Google Chrome:
- PRESS 'Ctrl+T' or 'Cmd+T' to open a new tab.
- PRESS 'Ctrl+L' or 'Cmd+L' to focus the address bar.
- TYPE "{{search term}}".
- PRESS 'Enter' to initiate the search.
...
"""

This is then parsed, the operator will execute:

def press_keys(key_sequence):
 """
 Simulates pressing a sequence of keys.
 """
 keys = key_sequence.lower().split("+")
 # todo: the pyautogui.hotKey(*keys) didn't work as expected, it should be fixed .
 for key in keys:
     pyautogui.keyDown(key)
 for key in reversed(keys):
     pyautogui.keyUp(key)
 return f"Pressed keys: {key_sequence}"

🤷 OS Agnostic Approach: The enhanced VISION_PROMPT significantly improves context awareness. This feature can pick out the operating system and application in use from the provided screenshots, enabling a more adaptive, OS-agnostic functionality. This adaptability ensures that the tool remains universally applicable, irrespective of the user’s operating environment:

VISION_PROMPT = """
You are a hypothetical, OS-agnostic Self-Operating Computer, designed to simulate interaction with any 
graphical user interface. Your role is to analyze visual input, provided as a screenshot with a grid overlay, and 
suggest a series of simulated actions to accomplish the user's task. While you do not directly execute these 
actions, your suggestions aim to demonstrate precision and efficiency, favoring keyboard shortcuts over mouse 
interactions wherever possible.

Consider the current screen and the user's objective. Your task is to determine the most efficient simulated 
actions, utilizing application-specific shortcuts that are universally applicable across various operating systems.
...
"""

🧠 More Defined Contextual Understanding: The revised VISION_PROMPT now offers a more nuanced and defined approach to understanding and responding to user tasks. This includes a heightened ability to discern specific requirements based on the visual cues within the user interface, leading to more accurate and contextually relevant suggestions.
⚠️ Stricter role-play: The revised VISION_PROMPT now offers a stricter role-play to ensure GPT-4V's response aligns with the context:
```
VISION_PROMPT="""
...
Your ONLY SIMULATED ACTIONS are:
...
"""
```

I believe by utilizing this change, we can benefit the outlined points; let me know what you think.

Nov 28 '23 11:11 0x5844

@sirlolcat this looks awesome. I agree with the approach. I'll take a look at this PR a little closer and let you know if I have any questions. If you'd like to discuss further, feel free to reach out to me at [email protected]

Nov 28 '23 15:11 joshbickett

@joshbickett I appreciate the time you're putting into this. I'll adjust this PR accordingly as its draft, let me know if you want to discuss or change anything.

Nov 28 '23 16:11 0x5844

@sirlolcat this may have some impacts on the vision of the project so we're discussing this PR still. Thank you again for the PR. Hope to have an update soon!

Dec 01 '23 15:12 joshbickett

@joshbickett thanks for the update; I appreciate the time and effort you're putting into this, I m open to further discussion. I would be more than happy to help you with the discussion on this.

Dec 01 '23 15:12 0x5844

@sirlolcat wanted to let you know I haven't forgotten this PR! This project got a lot more attention than expecting so we're sorting through priorities, but I'll get to this eventually! :)

Dec 05 '23 03:12 joshbickett

@sirlolcat spoke to the team about this and they agreed it is a priority. I'll try to take a look at it closer this week. Since it is a bit older we'll need to resolve some merge conflicts. I'll want to do a lot of testing on it too in order to ensure there's no change in performance related to other prompt elements

Dec 06 '23 14:12 joshbickett

Amazing, I can help with benchmarking. Would you like to discuss the approach further first? I'm flexible to putting more time on this. Let me know about it.

Dec 06 '23 14:12 0x5844

Help me

Dec 06 '23 23:12 Sameeraali835

https://github.com/OthersideAI/self-operating-computer/assets/42594239/7692eeff-ec2b-4bcc-97bc-23228395df8c

@sirlolcat I reviewed this yesterday but I think I forgot to respond. I think this PR is on the right track, but I have a few additional thoughts.

I typically run two basic test cases over PRs:

"Go to YouTube and play Holiday music"
"Go to Google Docs and write a poem"

PRESS occasionally makes a more optimal action and that was great to see, but it often failed test cases (see video). It appeared the main issue was trying a key command when the right window was not active. Long-term I think this PRESS method could be key to the project. The goal is to emulate how a human interacts with the modern computer via inputs and outputs and this PR progresses us towards that goal.

Additional thoughts.. right now gpt-4-vision-preview is pretty bad at following instructions from my experimenting. I had to play a lot with thinning down the prompt to make it as basic as possible for the model. If you can get this PR to pass these tests cases and a few others I think we could merge it in, but you may find that gpt-4-vision-preview is too basic. With that said I think as models get better this type of prompt you built will work. Let's keep this conversation going and see what we can do!

Dec 09 '23 14:12 joshbickett

I am testing out different prompts and testing the results. A bit more changes in vision prompt can improve the results..

Dec 19 '23 09:12 Daisuke134

Hi @sirlolcat, I am curious if you have any new findings / updates on this PR. Let me know, thanks!

Dec 30 '23 16:12 joshbickett

@sirlolcat after hacking around with a lot of methods to get key commands to work, it appears re-architecting the project to use the system_prompt did it. Not sure why I didn't design it originally that way, I think I was sleep deprived lol.

Anyway, I am going to close this PR now that key commands are integrated in the project.

Jan 16 '24 18:01 joshbickett