Add scrolling support and replace CLICK action
This PR aims to achieve two primary objectives:
-
Support vertical mouse-wheel scrolling to let the model access UI elements which currently aren't on the screen.
-
Replace the CLICK action with a MOUSE action. The MOUSE action will open up broader support for what the model can do with the user's cursor.
Currently, a CLICK action moves the cursor to (X, Y) on screen then clicks on whatever UI element is below it. The MOUSE action will give the model the following possible capabilities:
- Move cursor to (X, Y), but don't click [Useful for any requirements to hover or giving the model a chance to first confirm if it's on the correct target UI element].
- Move the cursor to (X, Y) and do a left-click.
- Move the cursor to (X, Y) and do a positive or negative mouse-wheel scroll.
This more generalized MOUSE action opens the possibility to let the model do more things with the mouse in the future, such as a right-click or drag.
Merging the successful completion of this PR will close #74.
I am open to any suggestions or feedback as I make these changes!
Okay I think that's a wrap. The model can now hover the cursor at a chosen location and scroll up or down. I also added a bit to the vision prompt and it seems to have drastically reduced repetition for me.
@michaelhhogue thanks for this PR. Still need to review it. I'll let you know if I have any questions!
@michaelhhogue had a chance to take a closer look. I think this approach is very interesting. It makes sense to expand this action from click to mouse and to closer emulate human actions. Since this is an architecture change I am doing testing and looking to understand it further. A few notes below.
- One of my common test cases is failing: "Go to youtube and play holiday music". Does this one work for you?
- I am curious to see how the scroll is performing. Do you have some test cases in mind that I could run to learn more?
- On another note, we're currently building a
agent-1under the old architecture. For this reason, we may not be able to integrate your new architecture yet. I'm going to review this new architecture while building out the API and will see what we can do. Long term we'll want this architecture in our API. It just may not be right away
https://github.com/OthersideAI/self-operating-computer/assets/42594239/f4dc70ea-a86d-4b49-b2e0-f0708a68566e
Here's a video of the "Go to youtube and play holiday music" test case and how it failed. It did a double click
@joshbickett Glad you like the concept! I totally understand if you can't merge this due to the architectural changes. I'm going to put this back to draft for now until I figure out what's causing the double clicks in your test case. It could be that I added too much to the vision prompt. I'll keep you posted!
Sounds good another PR that I think would be great would be a test case PR. We could develop a test cases to run over all PRS to see how they're doing
ut this back to draft for now until I figure out what's causing the double clicks in your test case. It could be that I added too much to the vision prompt. I'll keep you posted!
@michaelhhogue was thinking about this more. It'd be great to get this architecture in and I'm happy to pair on it. Let me know if there's anything I can help with. I am still not sure why that one test case was failing.
@joshbickett Great to hear! I've got this on hold at the moment. Currently I'm working on a PR that replaces platform-specific screenshot methods with one platform-agnostic screenshot function using MSS. This will also enable support for multi-monitor setups.
I'll let you know when I'm back to working on this PR. In the mean time, feel free to add any commits here improving the architecture or fixing any issues I missed.
@michaelhhogue sounds good. One the screenshot PR, I had some kind of issue on Mac trying to use ImageGrab I think but I can't recall exactly what it was. After having that issue I elected for subprocess.run(["screencapture", "-C", file_path]). I think maybe it didn't show the mouse or wasn't compatible with showing the mouse. Hmm, I'll try to remember and let you know
@joshbickett I've tested the common case of "go to youtube and play holiday music" multiple times and I haven't had any issues with a double click. I actually notice reduced repetition with how I modified the vision prompt. However, it never takes the straightforward path to finding the video by just submitting the search query directly into the youtube url. Instead, it always googles holiday music and tries to click on the first video link. Sometimes it can click it, sometimes it can't. I've noticed this on the main branch as well.
If you want, you could try changing the vision prompt to see if it stops doing the double click for you.
@Daisuke134 To prevent the model from scrolling down to some point in the page and then clicking on some random UI element without seeing it first. It should stop after scrolling, look at the screen, and determine where to actually do the click.
If scrolling and a click are done in the same action you'll have issues in sites like Wikipedia where it'll scroll down, accidentally click a link, and go to the wrong article.
@michaelhhogue Thank you so much. I will check out the scrolling action🙇♂️
Scrolling and clicking are mutually excluse. You shouldn't click in the same action as a scroll since you won't yet know what you'll click on.
This makes sense now that I think of it.
@michaelhhogue I'll take another look at this PR and let you know if I have any questions.
I'm going to go ahead and close this PR for now since the architecture has changed a lot since I opened it. I'll possibly revisit this in the future.