self-operating-computer
self-operating-computer copied to clipboard
Poor accuracy of pointer X/Y location inference
the X/Y coordinates inferred by the model are always off. It can't even select the address bar correctly.
@ahsin-s From README.md:
Note: GPT-4V's error rate in estimating XY mouse click locations is currently quite high. This framework aims to track the progress of multimodal models over time, aspiring to achieve human-level performance in computer operation.
There are demo videos showing this working with gpt4 and they seemed to at least get the model to click the address bar but I'm not sure if that was by placing the browser where the model expects it to be. Is the state of the art multimodal model still unable to achieve a simple address bar click?
@ahsin-s this is a known limitation of GPT-4V at the moment. From OpenAI Documentation, "Limitations: [...] Spatial reasoning: The model struggles with tasks requiring precise spatial localization, such as identifying chess positions."
On my machines, Linux, MacOs and Windows, I cannot get a single address bar click correct. I worked interactively with GPT4 uploading a screenshot and asking "give me the coordinates of the center of the 'New Mail' button in Outlook" and I received absolute coordinates that were pretty accurate. Would it help to shift the model to use absolute coordinates rather than %?
On my machines, Linux, MacOs and Windows, I cannot get a single address bar click correct. I worked interactively with GPT4 uploading a screenshot and asking "give me the coordinates of the center of the 'New Mail' button in Outlook" and I received absolute coordinates that were pretty accurate. Would it help to shift the model to use absolute coordinates rather than %?
It seems GPT4 can estimate the coordinates based on its own data and assumptions, and not using your screenshot.
That is the crux of the issue. Project maintainers don't seem to point out that gpt4v is NOT leveraging the grid overlay to estimate coordinates and is relying on heuristics in its training data instead. It does not use the image at all to infer where to place the mouse cursor.
Hi,
Until OpenAI improves the accuracy of pointer location, wouldn't it be wise to utilize keyboard shortcuts in Windows for easier navigation?
@ahsin-s this is a known limitation of GPT-4V at the moment. From OpenAI Documentation, "Limitations: [...] Spatial reasoning: The model struggles with tasks requiring precise spatial localization, such as identifying chess positions."
Hadn't seen this in their docs, thanks for pointing out
Project maintainers don't seem to point out that gpt4v is NOT leveraging the grid overlay to estimate coordinates and is relying on heuristics in its training data instead.
GPT-4-V does in fact leverage the grid. This is the type of screenshot (below) that is sent to the API. When using the project a lot you'll notice GPT-4-V sometimes decides to use the exact cross sections of the grid to click (which is poor logic).
Here's the section of the code where the grid is sent to GPT-4-v
def get_next_action_from_openai(messages, objective, accurate_mode):
...
try:
screenshots_dir = "screenshots"
if not os.path.exists(screenshots_dir):
os.makedirs(screenshots_dir)
screenshot_filename = os.path.join(screenshots_dir, "screenshot.png")
# Call the function to capture the screen with the cursor
capture_screen_with_cursor(screenshot_filename)
new_screenshot_filename = os.path.join(
"screenshots", "screenshot_with_grid.png"
)
add_grid_to_image(screenshot_filename, new_screenshot_filename, 500)
# sleep for a second
time.sleep(1)
with open(new_screenshot_filename, "rb") as img_file:
img_base64 = base64.b64encode(img_file.read()).decode("utf-8")
previous_action = get_last_assistant_message(messages)
vision_prompt = format_vision_prompt(objective, previous_action)
vision_message = {
"role": "user",
"content": [
{"type": "text", "text": vision_prompt},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{img_base64}"},
},
],
}
...
response = client.chat.completions.create(
model="gpt-4-vision-preview",
Ultimately the Grid allows this project to be a proof of concept until spacial reasoning or object detention is better
Project maintainers don't seem to point out that gpt4v is NOT leveraging the grid overlay to estimate coordinates and is relying on heuristics in its training data instead.
GPT-4-V does in fact leverage the grid. This is the type of screenshot (below) that is sent to the API. When using the project a lot you'll notice GPT-4-V sometimes decides to use the exact cross sections of the grid to click (which is poor logic).
Here's the section of the code where the grid is sent to GPT-4-v
def get_next_action_from_openai(messages, objective, accurate_mode): ... try: screenshots_dir = "screenshots" if not os.path.exists(screenshots_dir): os.makedirs(screenshots_dir) screenshot_filename = os.path.join(screenshots_dir, "screenshot.png") # Call the function to capture the screen with the cursor capture_screen_with_cursor(screenshot_filename) new_screenshot_filename = os.path.join( "screenshots", "screenshot_with_grid.png" ) add_grid_to_image(screenshot_filename, new_screenshot_filename, 500) # sleep for a second time.sleep(1) with open(new_screenshot_filename, "rb") as img_file: img_base64 = base64.b64encode(img_file.read()).decode("utf-8") previous_action = get_last_assistant_message(messages) vision_prompt = format_vision_prompt(objective, previous_action) vision_message = { "role": "user", "content": [ {"type": "text", "text": vision_prompt}, { "type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_base64}"}, }, ], } ... response = client.chat.completions.create( model="gpt-4-vision-preview",
Ultimately the Grid allows this project to be a proof of concept until spacial reasoning or object detention is better
‘Ultimately the Grid allows this project to be a proof of concept until spacial reasoning or object detention is better’ you mean object detection right?
Hello @joshbickett,
I have some exciting news to share about the development of your self-operating-computer application. Apple has recently introduced an innovative AI model called Apple ML FERRET, which could be an ideal solution for the challenges you're facing with the GPT4-Vision model, especially regarding the localization of buttons in a screenshot.
The solution is open source but very GPU intensive: https://github.com/apple/ml-ferret
@khalidovicGPT yes, I saw that. It is certainly an exciting update. Would you be interested in attempting a PR of Ferret into the Self-Operating Computer?
Hello @joshbickett, Happy New Year 2024! I am very honored to have the opportunity to contribute to the development of this remarkable application. The beginning of this year is quite busy for me in terms of availability, but I will do my best to help. I am going to inquire about the possibility of obtaining a server equipped with the ML FERRET model, with configured API access. I will keep you updated on my progress.
@khalidovicGPT sounds good. I briefly tried to run Ferret to try it out but realized it required more time that I had yesterday morning. From briefly reading the paper it sounds like it is accurate at detecting objects by X & Y pixel coordinates. It sounds like it could be compatible with the project. Interested to see if you can get it working! I wonder if anyone has an inference endpoint available for Ferret yet? We could use an established endpoint and have a key environmental variable
Hi @joshbickett , good ! you have done more than me. I will update if I progress. Good luck!
Hi Josh !
Hope you're doing well. Have you made any progress on the Ferret model?
On my end, I've been exploring a lighter and more efficient model: Vstar.
Source: https://github.com/penghao-wu/vstar
You can test it here: https://craigwu-vstar.hf.space/
The documentation: https://vstar-seal.github.io/
There's even a YouTube video on this model if you understand Spanish: https://youtu.be/4goxXl2Lk_M?si=gcthpZt8Y5DjxOkZ
I tried installing it locally, but it seems it doesn't work on PowerShell, only on a Linux OS with CUDA version lower than 2.
I'll let you see if you can make more progress than me.
I'll keep you updated if I manage to make more progress on my end. Good luck, Josh :)
Here are some tests that were conducted:
I wanted to mention the OCR approach for those who have not seen it, it goes part way to solve this issue.
I'll keep you updated if I manage to make more progress on my end. Good luck, Josh :
@khalidovicGPT curious if you were able to make any more progress?
I'll close this ticket for now since we have the OCR approach now!