[Bug]: Missing clicks
Version
v0.1.1
Model
UI-TARS-1.5-7B
Deployment Method
Cloud
Issue Description
OS Windows 11
I gave a seemingly simple command to open a web page and switch tenants (and later to try to log in). You can see on the screenshot below that it notices the option and is trying to click it - but it's using wrong coordinates every time.
Click coordinates from the screenshot: click(start_box: [0.07375776397515528,0.034482758620689655,0.07375776397515528,0.034482758620689655])
My screen resolution is 1920*1080 so it's nothing unusual. Also, the app opened Chrome in windowed mode by itself, and I did not resize it at all.
Error Logs
No response
What is the VLM Provider you select?
Ah, sorry for not mentioning it. It's Hugging Face for UI-TARS 1.5
I am having the same issue Running locally with LM Studio, model is ui-tars-1.5-7b-mlx Provider: Hugging Face for UI-TARS-1.5
Device: 16 inch Macbook
I'm having the same Issue. UI Tars Desktop latest version using LM Studio Server on Windows 11. Using the Model "llm hack337/ui-tars-1.5-7b" as it has Vision and Tool use support. UI Tars Desktop successfully connects, sometimes types in the google search box if already selected and thinks correctly and logically. The only problem is it ALWAYS misclicks no matter where on the screen it is attempting. It's coordinates are off.
If it is attempting to click closer to the top left corner of the screen, it's only off by a little bit, but the further towards the centre and bottom right you go, the worse it gets. Something isn't scaling right obviously. Also tried with Gemma 3 27B hosted through LM Studio, it connects and thinks correctly but clicks in the wrong coordinates.
See below, it attempted to click in the search box to search for something, but it missed:
A fix for this is URGENT as UI Tars is unusable in it's current state due to this bug.
I'm having the same Issue. UI Tars Desktop latest version using LM Studio Server on Windows 11. Using the Model "llm hack337/ui-tars-1.5-7b" as it has Vision and Tool use support. UI Tars Desktop successfully connects, sometimes types in the google search box if already selected and thinks correctly and logically. The only problem is it ALWAYS misclicks no matter where on the screen it is attempting. It's coordinates are off.
If it is attempting to click closer to the top left corner of the screen, it's only off by a little bit, but the further towards the centre and bottom right you go, the worse it gets. Something isn't scaling right obviously. Also tried with Gemma 3 27B hosted through LM Studio, it connects and thinks correctly but clicks in the wrong coordinates.
See below, it attempted to click in the search box to search for something, but it missed:
A fix for this is URGENT as UI Tars is unusable in it's current state due to this bug.
Hi there, you can refer to this issue https://github.com/bytedance/UI-TARS/issues/150 to check if you are meeting with problems of resolutions.
Hi there, you can refer to this issue bytedance/UI-TARS#150 to check if you are meeting with problems of resolutions.
Yes, my resolution is: 3840 x 2160. I've attempted lowering my screen resolution to 1080p but this did not fix the issue sadly.
Same issue- M1 MBA UI-Tars 1.5 7b mlx, running via lmstudio
Same issue on m4pro + lmstudio + 1.5 7b mlx ... the click location is always incorrect.
Same issue on m4pro + lmstudio + 1.5 7b mlx ... the click location is always incorrect.
@Fraer Do you use latest release (v1.2.0) ? If you still use v1.1.0, I recommend update.
At least my environment works:
- Windows11
- Primary monitor 1920x1080p, scale factor:1
- Local source code deploy with pnpm. https://github.com/bytedance/UI-TARS-desktop/blob/main/CONTRIBUTING.md#run-the-application
[!Warning] UI-TARS-desktop currently support one screen. Even if you have multiple display, UI-TARS model can only see primary monitor and operator work inside it.
Information for SDK Users:
The default uiTarsVerison is v1.0 if you omit this parameter when you create a GUIAgent instance.
Therefore, UI-TARS-1.5 users must fill in the version explicitly.
Even if you input ByteDance-Seed/UI-TARS-1.5-7B to model parameter, any parser doesn't set UITarsVersion automatically.
Then click coordinate bag occurs.
@meme-dayo
Macos: Sequoia 15.5 (24F74)
UI-Tars: Version 0.1.2 (installed via brew install --cask ui-tars)
Still missing every single click for me. Win11, LM studio, ui-tars-desktop 0.1.2, cant even click at search box. In settings, i have set the VLM provider to "Hugging Face for UI-TARS-1.5", VLM Model Name to "ByteDance-Seed/UI-TARS-1.5-7B". Also tried llama.cpp-server and vllm, both browser and computer operator mode
I want to know about your primary screen monitor.
- Screen size: 1920 x 1080p
- Scale factor : 1, 1.25, 1.5 ...
- Retina display?
@meme-dayo my current Screen size: 1920 x 1080 No scale factor External screen S277HK (non retina)
tried with lower resolution 1600 x 900 ... same incorrect click position
tried with native screen of the macbook pro (retina) ... same incorrect click position System report:
Displays:
Color LCD:
Display Type: Built-in Liquid Retina XDR Display
Resolution: 3024 x 1964 Retina
Main Display: Yes
Mirror: Off
Online: Yes
Automatically Adjust Brightness: Yes
Connection Type: Internal
tried setting max resolution: 3024 x 1964 ... same incorrect click position
However, just once after changing from max resolution to "default" resolution 1512 x 982 without restarting the app, the click position became correct, however i could not reproduce it by switching again to max res then to default res, so it's incorrect again ... really weird. Looks like there is a serious problem with detection of screen resolution.
@Fraer Thank you for sharing worth information !
once after changing from max resolution to "default" resolution 1512 x 982 without restarting the app, the click position became correct
Perhaps, the DPR (Device Pixel Ratio) logic for Mac or Retina display needs reconsideration.
I wonder why following value is hard corded in Mac
scaleFactor = 1
at apps\ui-tars\src\main\utils\screen.ts
import { screen } from 'electron';
import * as env from '@main/env';
export const getScreenSize = () => {
const primaryDisplay = screen.getPrimaryDisplay();
const logicalSize = primaryDisplay.size; // Logical = Physical / scaleX
// Mac retina display scaleFactor = 1
const scaleFactor = env.isMacOS ? 1 : primaryDisplay.scaleFactor;
const physicalSize = {
width: Math.round(logicalSize.width * scaleFactor),
height: Math.round(logicalSize.height * scaleFactor),
};
//...
};
Then, scaleFactor is multiplied at ActionParser for mapping coordinates from UI-TARS to operator function like click(x, y).
packages\ui-tars\action-parser\src\actionParser.ts
export function parseActionVlm(
text: string,
factors: [number, number] = [1000, 1000],
mode: 'bc' | 'o1' = 'bc',
screenContext?: {
width: number;
height: number;
},
scaleFactor?: number,
modelVer: UITarsModelVersion = UITarsModelVersion.V1_0,
): PredictionParsed[] {
//...
if (screenContext?.width && screenContext?.height) {
const boxKey = paramName.includes('start_box')
? 'start_coords'
: 'end_coords';
const [x1, y1, x2 = x1, y2 = y1] = floatNumbers;
const [widthFactor, heightFactor] = factors;
actionInputs[boxKey] = [x1, y1, x2, y2].every(isNumber)
? [
(Math.round(
((x1 + x2) / 2) * screenContext?.width * widthFactor,
) /
widthFactor) *
(scaleFactor ?? 1), // ☆Multiplied here☆
(Math.round(
((y1 + y2) / 2) * screenContext?.height * heightFactor,
) /
heightFactor) *
(scaleFactor ?? 1), // ☆Multiplied here☆
]
: [];
}
Sorry but i just tried using v0.1.3 and the problem is still there on macos. Not sure this bug should be closed.
Mode: "Browser" Prompt: "Go to lemonde.fr"
Example of invalid click location:
Yes, it happens to me on Windows too. I've tried every quantized gguf model, from small to very large, and tried all possible resolutions, but it always clicks incorrectly. Only if you're really lucky will it click correctly once. If there were a reasonable guide then I would try it with vllm but I already tried to install it with docker but only got errors