Refined Vision Prompt: Integration of Keyboard Shortcuts over Search Function
✨ Refined Vision Prompt: Integration of Keyboard Shortcuts over Search Function
This PR proposes a significant methodological enhancement to the VISION_PROMPT framework. I'm proposing the PRESS action as a more efficient alternative to the existing SEARCH function. This update is aimed at refining the interaction model to better align with intuitive, real-world navigation methods. The proposed changes underscore several key benefits:
-
💻 Enhanced Shortcut Utilization: The
PRESSaction could be a single or a combination of key presses, applicable in any active window that theoperatorengages with. This approach significantly streamlines navigation across different operating systems by using known shortcuts. This change also offers a more precise and efficient means of interaction:VISION_PROMPT = """ ... 1. PRESS - Recommend keyboard shortcuts for efficient navigation and interaction. Response Format: PRESS "{{key combination}}", detailing the specific action's utility. ... Example Response: For tasks in Google Chrome: - PRESS 'Ctrl+T' or 'Cmd+T' to open a new tab. - PRESS 'Ctrl+L' or 'Cmd+L' to focus the address bar. - TYPE "{{search term}}". - PRESS 'Enter' to initiate the search. ... """This is then parsed, the
operatorwill execute:def press_keys(key_sequence): """ Simulates pressing a sequence of keys. """ keys = key_sequence.lower().split("+") # todo: the pyautogui.hotKey(*keys) didn't work as expected, it should be fixed . for key in keys: pyautogui.keyDown(key) for key in reversed(keys): pyautogui.keyUp(key) return f"Pressed keys: {key_sequence}" -
🤷 OS Agnostic Approach: The enhanced
VISION_PROMPTsignificantly improves context awareness. This feature can pick out the operating system and application in use from the provided screenshots, enabling a more adaptive, OS-agnostic functionality. This adaptability ensures that the tool remains universally applicable, irrespective of the user’s operating environment:VISION_PROMPT = """ You are a hypothetical, OS-agnostic Self-Operating Computer, designed to simulate interaction with any graphical user interface. Your role is to analyze visual input, provided as a screenshot with a grid overlay, and suggest a series of simulated actions to accomplish the user's task. While you do not directly execute these actions, your suggestions aim to demonstrate precision and efficiency, favoring keyboard shortcuts over mouse interactions wherever possible. Consider the current screen and the user's objective. Your task is to determine the most efficient simulated actions, utilizing application-specific shortcuts that are universally applicable across various operating systems. ... """ -
🧠 More Defined Contextual Understanding: The revised
VISION_PROMPTnow offers a more nuanced and defined approach to understanding and responding to user tasks. This includes a heightened ability to discern specific requirements based on the visual cues within the user interface, leading to more accurate and contextually relevant suggestions. -
⚠️ Stricter role-play: The revised
VISION_PROMPTnow offers a stricter role-play to ensure GPT-4V's response aligns with the context:VISION_PROMPT=""" ... Your ONLY SIMULATED ACTIONS are: ... """
I believe by utilizing this change, we can benefit the outlined points; let me know what you think.
@sirlolcat this looks awesome. I agree with the approach. I'll take a look at this PR a little closer and let you know if I have any questions. If you'd like to discuss further, feel free to reach out to me at [email protected]
@joshbickett I appreciate the time you're putting into this. I'll adjust this PR accordingly as its draft, let me know if you want to discuss or change anything.
@sirlolcat this may have some impacts on the vision of the project so we're discussing this PR still. Thank you again for the PR. Hope to have an update soon!
@joshbickett thanks for the update; I appreciate the time and effort you're putting into this, I m open to further discussion. I would be more than happy to help you with the discussion on this.
@sirlolcat wanted to let you know I haven't forgotten this PR! This project got a lot more attention than expecting so we're sorting through priorities, but I'll get to this eventually! :)
@sirlolcat spoke to the team about this and they agreed it is a priority. I'll try to take a look at it closer this week. Since it is a bit older we'll need to resolve some merge conflicts. I'll want to do a lot of testing on it too in order to ensure there's no change in performance related to other prompt elements
Amazing, I can help with benchmarking. Would you like to discuss the approach further first? I'm flexible to putting more time on this. Let me know about it.
Help me
https://github.com/OthersideAI/self-operating-computer/assets/42594239/7692eeff-ec2b-4bcc-97bc-23228395df8c
@sirlolcat I reviewed this yesterday but I think I forgot to respond. I think this PR is on the right track, but I have a few additional thoughts.
I typically run two basic test cases over PRs:
- "Go to YouTube and play Holiday music"
- "Go to Google Docs and write a poem"
PRESS occasionally makes a more optimal action and that was great to see, but it often failed test cases (see video). It appeared the main issue was trying a key command when the right window was not active. Long-term I think this PRESS method could be key to the project. The goal is to emulate how a human interacts with the modern computer via inputs and outputs and this PR progresses us towards that goal.
Additional thoughts.. right now gpt-4-vision-preview is pretty bad at following instructions from my experimenting. I had to play a lot with thinning down the prompt to make it as basic as possible for the model. If you can get this PR to pass these tests cases and a few others I think we could merge it in, but you may find that gpt-4-vision-preview is too basic. With that said I think as models get better this type of prompt you built will work. Let's keep this conversation going and see what we can do!
I am testing out different prompts and testing the results. A bit more changes in vision prompt can improve the results..
Hi @sirlolcat, I am curious if you have any new findings / updates on this PR. Let me know, thanks!
@sirlolcat after hacking around with a lot of methods to get key commands to work, it appears re-architecting the project to use the system_prompt did it. Not sure why I didn't design it originally that way, I think I was sleep deprived lol.
Anyway, I am going to close this PR now that key commands are integrated in the project.