self-operating-computer copied to clipboard
Good job but seems to be missing some things
Is there a possibility that a Retina display of a Mac or in general a 4K resolution screen confuse the algorithm ?
The mouse could not find the elements that the text output was showing. And thus it got confused and would click irregularly.
Note that it was tested on latest version of MacOS
@pligor this is likely as I'm experiencing similar issues with a four monitor set up in a 2x2 grid (1920x1080x4). Try switching to 1920x1080 and running it.
@pligor Interested to see the affects of using a 4k Retina display. Screenshot image scaling definitely needs to be looked into. Do you have any more info on how SOC is performing for you with your display?
Hi all. I had set the resolution to 1582x982 in the macbook pro laptop, without any other screen connected to it. Tried a couple of more times but the performance was very poor. Can GPT vision provide accurate boxes or points on the screen ? It is important the the mouse is headed to the correct location on the screen otherwise this will never work appropriately.. :) I can provide a video if necessary
Same here, on windows. i tried using then only 1 screen with same result. I believe the Vision it not providing the coordninates well probably...
From my manual tests with GPTV it can not instruct effectively.
@pligor This pretty much sounds like the known low click-accuracy. Even though error rates are high, do you notice any sort of higher accuracy at lower resolutions when compared to 4K?
I (complete noob here) think this needs a model fine tuned to know how to control a computer solely using keyboard shortcuts. Everything should be achievable that way too. No? I think that would be much simpler than trying to scan what's on the screen and then moving the mouse... I don't know... it sounds easier and more accurate. I mean I'm a filmmaker and even as a human I edit the fastest if I only use the keys :)
If anyone knows a good data set we could use I have a Petals swarm we can use for fine tuning something like llama2... Am I completely in the dark here or? Also I just love localized open source solutions. this is what will save us all.