UI-TARS-desktop icon indicating copy to clipboard operation
UI-TARS-desktop copied to clipboard

[Bug]: Missing clicks

Open NiksaSprinting opened this issue 7 months ago • 7 comments

Version

v0.1.1

Model

UI-TARS-1.5-7B

Deployment Method

Cloud

Issue Description

OS Windows 11

I gave a seemingly simple command to open a web page and switch tenants (and later to try to log in). You can see on the screenshot below that it notices the option and is trying to click it - but it's using wrong coordinates every time.

Click coordinates from the screenshot: click(start_box: [0.07375776397515528,0.034482758620689655,0.07375776397515528,0.034482758620689655])

Image

My screen resolution is 1920*1080 so it's nothing unusual. Also, the app opened Chrome in windowed mode by itself, and I did not resize it at all.

Error Logs

No response

NiksaSprinting avatar May 13 '25 13:05 NiksaSprinting

What is the VLM Provider you select?

ZhaoHeh avatar May 14 '25 08:05 ZhaoHeh

Ah, sorry for not mentioning it. It's Hugging Face for UI-TARS 1.5

NiksaSprinting avatar May 14 '25 08:05 NiksaSprinting

I am having the same issue Running locally with LM Studio, model is ui-tars-1.5-7b-mlx Provider: Hugging Face for UI-TARS-1.5

Device: 16 inch Macbook

adntgv avatar May 16 '25 11:05 adntgv

I'm having the same Issue. UI Tars Desktop latest version using LM Studio Server on Windows 11. Using the Model "llm hack337/ui-tars-1.5-7b" as it has Vision and Tool use support. UI Tars Desktop successfully connects, sometimes types in the google search box if already selected and thinks correctly and logically. The only problem is it ALWAYS misclicks no matter where on the screen it is attempting. It's coordinates are off.

If it is attempting to click closer to the top left corner of the screen, it's only off by a little bit, but the further towards the centre and bottom right you go, the worse it gets. Something isn't scaling right obviously. Also tried with Gemma 3 27B hosted through LM Studio, it connects and thinks correctly but clicks in the wrong coordinates.

See below, it attempted to click in the search box to search for something, but it missed: Image

A fix for this is URGENT as UI Tars is unusable in it's current state due to this bug.

AstralPhaser avatar May 19 '25 04:05 AstralPhaser

I'm having the same Issue. UI Tars Desktop latest version using LM Studio Server on Windows 11. Using the Model "llm hack337/ui-tars-1.5-7b" as it has Vision and Tool use support. UI Tars Desktop successfully connects, sometimes types in the google search box if already selected and thinks correctly and logically. The only problem is it ALWAYS misclicks no matter where on the screen it is attempting. It's coordinates are off.

If it is attempting to click closer to the top left corner of the screen, it's only off by a little bit, but the further towards the centre and bottom right you go, the worse it gets. Something isn't scaling right obviously. Also tried with Gemma 3 27B hosted through LM Studio, it connects and thinks correctly but clicks in the wrong coordinates.

See below, it attempted to click in the search box to search for something, but it missed: Image

A fix for this is URGENT as UI Tars is unusable in it's current state due to this bug.

Hi there, you can refer to this issue https://github.com/bytedance/UI-TARS/issues/150 to check if you are meeting with problems of resolutions.

Taoran-Lu avatar May 19 '25 10:05 Taoran-Lu

Hi there, you can refer to this issue bytedance/UI-TARS#150 to check if you are meeting with problems of resolutions.

Yes, my resolution is: 3840 x 2160. I've attempted lowering my screen resolution to 1080p but this did not fix the issue sadly.

AstralPhaser avatar May 19 '25 11:05 AstralPhaser

Same issue- M1 MBA UI-Tars 1.5 7b mlx, running via lmstudio

madeleinelmuller avatar May 21 '25 04:05 madeleinelmuller

Same issue on m4pro + lmstudio + 1.5 7b mlx ... the click location is always incorrect.

Fraer avatar Jun 01 '25 01:06 Fraer

Same issue on m4pro + lmstudio + 1.5 7b mlx ... the click location is always incorrect.

@Fraer Do you use latest release (v1.2.0) ? If you still use v1.1.0, I recommend update.

At least my environment works:

  • Windows11
  • Primary monitor 1920x1080p, scale factor:1
  • Local source code deploy with pnpm. https://github.com/bytedance/UI-TARS-desktop/blob/main/CONTRIBUTING.md#run-the-application

[!Warning] UI-TARS-desktop currently support one screen. Even if you have multiple display, UI-TARS model can only see primary monitor and operator work inside it.

Information for SDK Users:

The default uiTarsVerison is v1.0 if you omit this parameter when you create a GUIAgent instance. Therefore, UI-TARS-1.5 users must fill in the version explicitly. Even if you input ByteDance-Seed/UI-TARS-1.5-7B to model parameter, any parser doesn't set UITarsVersion automatically. Then click coordinate bag occurs.

meme-dayo avatar Jun 01 '25 04:06 meme-dayo

@meme-dayo

Macos: Sequoia 15.5 (24F74) UI-Tars: Version 0.1.2 (installed via brew install --cask ui-tars)

Fraer avatar Jun 02 '25 08:06 Fraer

Still missing every single click for me. Win11, LM studio, ui-tars-desktop 0.1.2, cant even click at search box. In settings, i have set the VLM provider to "Hugging Face for UI-TARS-1.5", VLM Model Name to "ByteDance-Seed/UI-TARS-1.5-7B". Also tried llama.cpp-server and vllm, both browser and computer operator mode

Mimocro avatar Jun 02 '25 14:06 Mimocro

I want to know about your primary screen monitor.

  • Screen size: 1920 x 1080p
  • Scale factor : 1, 1.25, 1.5 ...
  • Retina display?

meme-dayo avatar Jun 02 '25 14:06 meme-dayo

@meme-dayo my current Screen size: 1920 x 1080 No scale factor External screen S277HK (non retina)

tried with lower resolution 1600 x 900 ... same incorrect click position

tried with native screen of the macbook pro (retina) ... same incorrect click position System report:

Displays:
Color LCD:
  Display Type:	Built-in Liquid Retina XDR Display
  Resolution:	3024 x 1964 Retina
  Main Display:	Yes
  Mirror:	Off
  Online:	Yes
  Automatically Adjust Brightness:	Yes
  Connection Type:	Internal

tried setting max resolution: 3024 x 1964 ... same incorrect click position

However, just once after changing from max resolution to "default" resolution 1512 x 982 without restarting the app, the click position became correct, however i could not reproduce it by switching again to max res then to default res, so it's incorrect again ... really weird. Looks like there is a serious problem with detection of screen resolution.

Fraer avatar Jun 02 '25 23:06 Fraer

@Fraer Thank you for sharing worth information !

once after changing from max resolution to "default" resolution 1512 x 982 without restarting the app, the click position became correct

Perhaps, the DPR (Device Pixel Ratio) logic for Mac or Retina display needs reconsideration.

I wonder why following value is hard corded in Mac

scaleFactor = 1

at apps\ui-tars\src\main\utils\screen.ts

import { screen } from 'electron';

import * as env from '@main/env';

export const getScreenSize = () => {
  const primaryDisplay = screen.getPrimaryDisplay();

  const logicalSize = primaryDisplay.size; // Logical = Physical / scaleX
  // Mac retina display scaleFactor = 1
  const scaleFactor = env.isMacOS ? 1 : primaryDisplay.scaleFactor;

  const physicalSize = {
    width: Math.round(logicalSize.width * scaleFactor),
    height: Math.round(logicalSize.height * scaleFactor),
  };
  //...
};

Then, scaleFactor is multiplied at ActionParser for mapping coordinates from UI-TARS to operator function like click(x, y).

packages\ui-tars\action-parser\src\actionParser.ts
export function parseActionVlm(
  text: string,
  factors: [number, number] = [1000, 1000],
  mode: 'bc' | 'o1' = 'bc',
  screenContext?: {
    width: number;
    height: number;
  },
  scaleFactor?: number,
  modelVer: UITarsModelVersion = UITarsModelVersion.V1_0,
): PredictionParsed[] {
//...
          if (screenContext?.width && screenContext?.height) {
            const boxKey = paramName.includes('start_box')
              ? 'start_coords'
              : 'end_coords';
            const [x1, y1, x2 = x1, y2 = y1] = floatNumbers;
            const [widthFactor, heightFactor] = factors;

            actionInputs[boxKey] = [x1, y1, x2, y2].every(isNumber)
              ? [
                  (Math.round(
                    ((x1 + x2) / 2) * screenContext?.width * widthFactor,
                  ) /
                    widthFactor) *
                    (scaleFactor ?? 1),        // ☆Multiplied here☆
                  (Math.round(
                    ((y1 + y2) / 2) * screenContext?.height * heightFactor,
                  ) /
                    heightFactor) *
                    (scaleFactor ?? 1),       // ☆Multiplied here☆
                ]
              : [];
          }

meme-dayo avatar Jun 03 '25 09:06 meme-dayo

Sorry but i just tried using v0.1.3 and the problem is still there on macos. Not sure this bug should be closed.

Mode: "Browser" Prompt: "Go to lemonde.fr"

Example of invalid click location:

Image

Fraer avatar Jun 09 '25 15:06 Fraer

Yes, it happens to me on Windows too. I've tried every quantized gguf model, from small to very large, and tried all possible resolutions, but it always clicks incorrectly. Only if you're really lucky will it click correctly once. If there were a reasonable guide then I would try it with vllm but I already tried to install it with docker but only got errors

AndyZocker avatar Jun 11 '25 08:06 AndyZocker