Add a grid of coordinates
Since it likes to misclicks a lot, you could either train a model to do image segmentation or, you can with clever prompt engineering add a barebones grid asking to solve the puzzle, "in which coordinate can the search button be found" this should make it more robust, right?
See also:
- https://github.com/OthersideAI/self-operating-computer/issues/3
add a barebones grid
I noticed that you currently seem to apply a grid to the images to assist the vision model:
Originally posted by @0xdevalias in https://github.com/OthersideAI/self-operating-computer/issues/3
How about applying a dynamic grid approach to enhance click accuracy?
For example, we could adjust the grid density based on the proximity to the cursor. The areas closer to the cursor would have a denser grid, allowing for more accurate click predictions.
Set-of-Mark prompting is now available. Swap in your best best.pt from a YOLOv8 model and see how it does.