self-operating-computer Poor accuracy of pointer X/Y location inference

the X/Y coordinates inferred by the model are always off. It can't even select the address bar correctly.

Dec 12 '23 21:12 ahsin-s

@ahsin-s From README.md:

Note: GPT-4V's error rate in estimating XY mouse click locations is currently quite high. This framework aims to track the progress of multimodal models over time, aspiring to achieve human-level performance in computer operation.

Dec 13 '23 03:12 michaelhhogue

There are demo videos showing this working with gpt4 and they seemed to at least get the model to click the address bar but I'm not sure if that was by placing the browser where the model expects it to be. Is the state of the art multimodal model still unable to achieve a simple address bar click?

Dec 13 '23 04:12 ahsin-s

@ahsin-s this is a known limitation of GPT-4V at the moment. From OpenAI Documentation, "Limitations: [...] Spatial reasoning: The model struggles with tasks requiring precise spatial localization, such as identifying chess positions."

Dec 13 '23 19:12 KBB99

On my machines, Linux, MacOs and Windows, I cannot get a single address bar click correct. I worked interactively with GPT4 uploading a screenshot and asking "give me the coordinates of the center of the 'New Mail' button in Outlook" and I received absolute coordinates that were pretty accurate. Would it help to shift the model to use absolute coordinates rather than %?

Dec 17 '23 13:12 fpapleux

On my machines, Linux, MacOs and Windows, I cannot get a single address bar click correct. I worked interactively with GPT4 uploading a screenshot and asking "give me the coordinates of the center of the 'New Mail' button in Outlook" and I received absolute coordinates that were pretty accurate. Would it help to shift the model to use absolute coordinates rather than %?

It seems GPT4 can estimate the coordinates based on its own data and assumptions, and not using your screenshot.

Dec 17 '23 19:12 bytecod3r

That is the crux of the issue. Project maintainers don't seem to point out that gpt4v is NOT leveraging the grid overlay to estimate coordinates and is relying on heuristics in its training data instead. It does not use the image at all to infer where to place the mouse cursor.

Dec 21 '23 01:12 ahsin-s

Hi,

Until OpenAI improves the accuracy of pointer location, wouldn't it be wise to utilize keyboard shortcuts in Windows for easier navigation?

Dec 21 '23 15:12 khalidovicGPT

@ahsin-s this is a known limitation of GPT-4V at the moment. From OpenAI Documentation, "Limitations: [...] Spatial reasoning: The model struggles with tasks requiring precise spatial localization, such as identifying chess positions."

Hadn't seen this in their docs, thanks for pointing out

Dec 21 '23 15:12 joshbickett

Project maintainers don't seem to point out that gpt4v is NOT leveraging the grid overlay to estimate coordinates and is relying on heuristics in its training data instead.

GPT-4-V does in fact leverage the grid. This is the type of screenshot (below) that is sent to the API. When using the project a lot you'll notice GPT-4-V sometimes decides to use the exact cross sections of the grid to click (which is poor logic).

Here's the section of the code where the grid is sent to GPT-4-v

def get_next_action_from_openai(messages, objective, accurate_mode):
...
    try:
        screenshots_dir = "screenshots"
        if not os.path.exists(screenshots_dir):
            os.makedirs(screenshots_dir)

        screenshot_filename = os.path.join(screenshots_dir, "screenshot.png")
        # Call the function to capture the screen with the cursor
        capture_screen_with_cursor(screenshot_filename)

        new_screenshot_filename = os.path.join(
            "screenshots", "screenshot_with_grid.png"
        )

        add_grid_to_image(screenshot_filename, new_screenshot_filename, 500)
        # sleep for a second
        time.sleep(1)

        with open(new_screenshot_filename, "rb") as img_file:
            img_base64 = base64.b64encode(img_file.read()).decode("utf-8")

        previous_action = get_last_assistant_message(messages)

        vision_prompt = format_vision_prompt(objective, previous_action)

        vision_message = {
            "role": "user",
            "content": [
                {"type": "text", "text": vision_prompt},
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{img_base64}"},
                },
            ],
        }

...

        response = client.chat.completions.create(
            model="gpt-4-vision-preview",

Ultimately the Grid allows this project to be a proof of concept until spacial reasoning or object detention is better

Dec 21 '23 15:12 joshbickett

Project maintainers don't seem to point out that gpt4v is NOT leveraging the grid overlay to estimate coordinates and is relying on heuristics in its training data instead.

GPT-4-V does in fact leverage the grid. This is the type of screenshot (below) that is sent to the API. When using the project a lot you'll notice GPT-4-V sometimes decides to use the exact cross sections of the grid to click (which is poor logic).

Here's the section of the code where the grid is sent to GPT-4-v
def get_next_action_from_openai(messages, objective, accurate_mode):
...
    try:
        screenshots_dir = "screenshots"
        if not os.path.exists(screenshots_dir):
            os.makedirs(screenshots_dir)

        screenshot_filename = os.path.join(screenshots_dir, "screenshot.png")
        # Call the function to capture the screen with the cursor
        capture_screen_with_cursor(screenshot_filename)

        new_screenshot_filename = os.path.join(
            "screenshots", "screenshot_with_grid.png"
        )

        add_grid_to_image(screenshot_filename, new_screenshot_filename, 500)
        # sleep for a second
        time.sleep(1)

        with open(new_screenshot_filename, "rb") as img_file:
            img_base64 = base64.b64encode(img_file.read()).decode("utf-8")

        previous_action = get_last_assistant_message(messages)

        vision_prompt = format_vision_prompt(objective, previous_action)

        vision_message = {
            "role": "user",
            "content": [
                {"type": "text", "text": vision_prompt},
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{img_base64}"},
                },
            ],
        }

...

        response = client.chat.completions.create(
            model="gpt-4-vision-preview",
Ultimately the Grid allows this project to be a proof of concept until spacial reasoning or object detention is better

‘Ultimately the Grid allows this project to be a proof of concept until spacial reasoning or object detention is better’ you mean object detection right?

Dec 24 '23 18:12 Zero-coder

Hello @joshbickett,

I have some exciting news to share about the development of your self-operating-computer application. Apple has recently introduced an innovative AI model called Apple ML FERRET, which could be an ideal solution for the challenges you're facing with the GPT4-Vision model, especially regarding the localization of buttons in a screenshot.

The solution is open source but very GPU intensive: https://github.com/apple/ml-ferret

localisation

Dec 28 '23 21:12 khalidovicGPT

@khalidovicGPT yes, I saw that. It is certainly an exciting update. Would you be interested in attempting a PR of Ferret into the Self-Operating Computer?

Dec 29 '23 14:12 joshbickett

Hello @joshbickett, Happy New Year 2024! I am very honored to have the opportunity to contribute to the development of this remarkable application. The beginning of this year is quite busy for me in terms of availability, but I will do my best to help. I am going to inquire about the possibility of obtaining a server equipped with the ML FERRET model, with configured API access. I will keep you updated on my progress.

Jan 01 '24 09:01 khalidovicGPT

@khalidovicGPT sounds good. I briefly tried to run Ferret to try it out but realized it required more time that I had yesterday morning. From briefly reading the paper it sounds like it is accurate at detecting objects by X & Y pixel coordinates. It sounds like it could be compatible with the project. Interested to see if you can get it working! I wonder if anyone has an inference endpoint available for Ferret yet? We could use an established endpoint and have a key environmental variable

Jan 01 '24 15:01 joshbickett

Hi @joshbickett , good ! you have done more than me. I will update if I progress. Good luck!

Jan 04 '24 10:01 khalidovicGPT

Hi Josh !

Hope you're doing well. Have you made any progress on the Ferret model?

On my end, I've been exploring a lighter and more efficient model: Vstar.

Source: https://github.com/penghao-wu/vstar

You can test it here: https://craigwu-vstar.hf.space/

The documentation: https://vstar-seal.github.io/

There's even a YouTube video on this model if you understand Spanish: https://youtu.be/4goxXl2Lk_M?si=gcthpZt8Y5DjxOkZ

I tried installing it locally, but it seems it doesn't work on PowerShell, only on a Linux OS with CUDA version lower than 2.

I'll let you see if you can make more progress than me.

I'll keep you updated if I manage to make more progress on my end. Good luck, Josh :)

Here are some tests that were conducted:

puissance

resultat2

classement

Jan 18 '24 11:01 khalidovicGPT

I wanted to mention the OCR approach for those who have not seen it, it goes part way to solve this issue.

Feb 02 '24 23:02 joshbickett

I'll keep you updated if I manage to make more progress on my end. Good luck, Josh :

@khalidovicGPT curious if you were able to make any more progress?

Feb 02 '24 23:02 joshbickett

I'll close this ticket for now since we have the OCR approach now!

Feb 09 '24 04:02 joshbickett

self-operating-computer self-operating-computer copied to clipboard

Poor accuracy of pointer X/Y location inference

self-operating-computer
self-operating-computer copied to clipboard