UI-TARS-desktop icon indicating copy to clipboard operation
UI-TARS-desktop copied to clipboard

feat(SDK): expose uiTarsVerison and update Basic Usage to click(w,h) correctly

Open meme-dayo opened this issue 6 months ago • 3 comments

Summary

Currently, the default version of UI-TARS model is 1.0 ! Therefore, invoking UI-TARS-1.5 without uiTarsVersion in GUIAgent causes issues #590 #591

I will offer test case soon.

Checklist

  • [x] Added or updated necessary tests (Optional).
  • [x] Updated documentation to align with changes (Optional).
  • [x] Verified no breaking changes, or prepared solutions for any occurring breaking changes (Optional).
  • [ ] My change does not involve the above items.

meme-dayo avatar May 30 '25 08:05 meme-dayo

UI-TARS-1.5 Coordinate Parsing Default Version Issue

Summary

Currently, the default version of UI-TARS model is 1.0! Therefore, invoking UI-TARS-1.5 without uiTarsVersion in GUIAgent causes coordinate parsing issues. This PR addresses the coordinate calculation discrepancies between V1.0 and V1.5 implementations.

Problem Description

When users specify a HuggingFace UI-TARS-1.5 model name and expect V1.5 behavior but omit the explicit uiTarsVersion parameter, the system defaults to V1.0 behavior. This causes significant coordinate calculation errors because:

  • V1.0 (default): Uses fixed DEFAULT_FACTORS=[1000, 1000] regardless of actual screen dimensions
  • V1.5 (explicit): Uses smartResizeFactors calculated from actual screen dimensions

The coordinate errors scale with:

  1. Screen size deviation from 1000x1000: Larger screens show exponentially worse errors
  2. Click position: Bottom-right positions show larger errors than top-left positions

Test Implementation

How to Run the Test

  1. File Location: Place the test file at the same level as packages/ui-tars/action-parser/test/
  2. Test File Name: parseActionVlm.error.demo.test.ts
  3. Run Command:
pnpm run test parseActionVlm.error.demo.test.ts

Test Code

Following is a code to demonstrate a gap between output of UI-TARS-1.5 vs actual coordinate:

parseActionVlm.error.demo.test.ts
/*
 * Comprehensive test for UI-TARS-1.5 coordinate parsing verification
 * This test demonstrates the coordinate calculation differences between V1.0 and V1.5
 * 
 * V1.0 (default): Uses fixed DEFAULT_FACTORS=[1000,1000] causing screen-size dependent errors
 * V1.5 (explicit): Uses smartResizeFactors calculated for actual screen dimensions - the correct approach
 */
import { describe, it, expect } from 'vitest';
import { parseActionVlm } from '../src/actionParser';
import { UITarsModelVersion } from '@ui-tars/shared/types';

describe('parseActionVlm - UI-TARS-1.5 Coordinate Fix Verification', () => {
  /**
   * Helper function to parse and log coordinate results for both versions
   */
  const parseAndLogBothVersions = (input, description, physicalScreen, scaleFactor = 1.0) => {
    // Calculate virtual (logical) screen dimensions
    const virtualScreen = {
      width: physicalScreen.width / scaleFactor,
      height: physicalScreen.height / scaleFactor
    };

    console.log(`\n\n=== ${description} ===`);
    console.log('Input:', input);
    console.log('Physical:', physicalScreen, 'Virtual:', virtualScreen, 'Scale:', scaleFactor);

    // Parse with V1.0 (default behavior - uses fixed DEFAULT_FACTORS=[1000,1000])
    const resultV10 = parseActionVlm(
      input,
      [1000, 1000],
      'bc',
      virtualScreen,
      scaleFactor,
      UITarsModelVersion.V1_0
    );

    // Parse with V1.5 (explicit version - uses smartResizeFactors for accurate coordinates)
    const resultV15 = parseActionVlm(
      input,
      [1000, 1000],
      'bc',
      virtualScreen,
      scaleFactor,
      UITarsModelVersion.V1_5
    );

    // Helper function to add color coding for error rates
    const formatErrorRate = (errorRate) => {
      if (errorRate >= 5) return `\x1b[91m${errorRate.toFixed(1)}%\x1b[0m`; // Bright red for >=5%
      return `\x1b[92m${errorRate.toFixed(1)}%\x1b[0m`; // Green for <5%
    };

    // Extract input coordinates and convert to physical screen coordinates
    const inputX = parseInt(input.match(/\[(\d+)/)?.[1] || '0');
    const inputY = parseInt(input.match(/,\s*(\d+)/)?.[1] || '0');
    const physicalInputX = inputX * scaleFactor;
    const physicalInputY = inputY * scaleFactor;

    // Extract and log V1.0 results (shows screen-size dependent errors)
    const actionV10 = resultV10[0];
    const boxV10 = JSON.parse(actionV10.action_inputs.start_box);
    const coordsV10 = actionV10.action_inputs.start_coords;
    const errorXV10 = ((Math.abs(coordsV10[0] - physicalInputX) / physicalScreen.width) * 100);
    const errorYV10 = ((Math.abs(coordsV10[1] - physicalInputY) / physicalScreen.height) * 100);
    console.log('\nV1.0 Results (DEFAULT_FACTORS=[1000,1000] - causes screen-size dependent errors):');
    console.log('  Normalized box:', boxV10.map(v => parseFloat(v.toFixed(4))));
    console.log('  Screen coords:', coordsV10, `<- w: ${formatErrorRate(errorXV10)}, h: ${formatErrorRate(errorYV10)} error`);

    // Extract and log V1.5 results (correct approach)
    const actionV15 = resultV15[0];
    const boxV15 = JSON.parse(actionV15.action_inputs.start_box);
    const coordsV15 = actionV15.action_inputs.start_coords;
    const errorXV15 = ((Math.abs(coordsV15[0] - physicalInputX) / physicalScreen.width) * 100);
    const errorYV15 = ((Math.abs(coordsV15[1] - physicalInputY) / physicalScreen.height) * 100);
    console.log('\nV1.5 Results (smartResizeFactors - correct coordinate calculation):');
    console.log('  Normalized box:', boxV15.map(v => parseFloat(v.toFixed(4))));
    console.log('  Screen coords:', coordsV15, `<- w: ${formatErrorRate(errorXV15)}, h: ${formatErrorRate(errorYV15)} error`);

    return { actionV10, actionV15, boxV10, boxV15 };
  };

  it('should demonstrate coordinate calculation differences on 1920x1080 display', () => {
    const screenContext = { width: 1920, height: 1080 };
    const scaleFactor = 1.0;
    
    // Test center position click
    const centerInput = "Action: click(start_box='[960, 540, 960, 540]')";
    const { actionV10, actionV15, boxV10, boxV15 } = parseAndLogBothVersions(
      centerInput, 
      'FHD 1920x1080 Center Position Test',
      screenContext,
      scaleFactor
    );

    // Verify both parsed successfully
    expect(actionV10.action_type).toBe('click');
    expect(actionV15.action_type).toBe('click');

    // V1.0 shows incorrect normalization due to fixed factors
    expect(boxV10[0]).toBeCloseTo(0.96, 3); // Should be 0.5 but shows 0.96
    expect(boxV10[1]).toBeCloseTo(0.54, 3); // Should be 0.5 but shows 0.54

    // V1.5 provides correct normalization
    expect(boxV15[0]).toBeCloseTo(0.5, 1); // Correctly normalized to center
    expect(boxV15[1]).toBeCloseTo(0.5, 1); // Correctly normalized to center

    // Test corner positions to show the extent of the error
    const positions = [
      { name: 'Top-left corner', coords: '[200, 100, 200, 100]' },
      { name: 'Bottom-right corner', coords: '[1800, 900, 1800, 900]' }
    ];

    positions.forEach(({ name, coords }) => {
      const input = `Action: click(start_box='${coords}')`;
      const { actionV10, actionV15 } = parseAndLogBothVersions(
        input, 
        `FHD 1920x1080 ${name}`,
        screenContext,
        scaleFactor
      );
      
      // Verify parsing succeeded for both versions
      expect(actionV10.action_type).toBe('click');
      expect(actionV15.action_type).toBe('click');
    });
  });

  it('should demonstrate coordinate calculation differences on WQHD 2560x1440 with 1.25 scale', () => {
    const physicalScreen = { width: 2560, height: 1440 };
    const scaleFactor = 1.25;
    
    // Test center position click - input coordinates are for virtual screen (2048x1152)
    const centerInput = "Action: click(start_box='[1024, 576, 1024, 576]')";
    const { actionV10, actionV15, boxV10, boxV15 } = parseAndLogBothVersions(
      centerInput, 
      'WQHD 2560x1440 Center Position Test (Scale 1.25)',
      physicalScreen,
      scaleFactor
    );

    // Verify both parsed successfully
    expect(actionV10.action_type).toBe('click');
    expect(actionV15.action_type).toBe('click');

    // V1.0 will show different errors compared to FHD due to screen size dependency
    expect(boxV10[0]).toBeCloseTo(1.024, 2); // Error on virtual screen size
    expect(boxV10[1]).toBeCloseTo(0.576, 2); // Different error pattern

    // V1.5 provides correct normalization regardless of screen size
    expect(boxV15[0]).toBeCloseTo(0.5, 1);
    expect(boxV15[1]).toBeCloseTo(0.5, 1);
  });

  it('should demonstrate coordinate calculation differences on 4K 3840x2160 display', () => {
    const screenContext = { width: 3840, height: 2160 };
    const scaleFactor = 1.0;
    
    // Test center position click - the error will be even more pronounced on 4K
    const centerInput = "Action: click(start_box='[1920, 1080, 1920, 1080]')";
    const { actionV10, actionV15, boxV10, boxV15 } = parseAndLogBothVersions(
      centerInput, 
      '4K 3840x2160 Center Position Test',
      screenContext,
      scaleFactor
    );

    // Verify both parsed successfully
    expect(actionV10.action_type).toBe('click');
    expect(actionV15.action_type).toBe('click');

    // V1.0 shows severe errors on 4K due to screen size dependency
    expect(boxV10[0]).toBeCloseTo(1.92, 2); // Massive error - nearly double!
    expect(boxV10[1]).toBeCloseTo(1.08, 2); // Also exceeds normalized range

    // V1.5 maintains correct normalization even on 4K
    expect(boxV15[0]).toBeCloseTo(0.5, 1);
    expect(boxV15[1]).toBeCloseTo(0.5, 1);
  });

  it('should demonstrate the root cause and solution', () => {
    console.log('\n\n=== Root Cause Analysis ===');
    console.log('Problem: UI-TARS model 1.5 without explicit version defaults to V1.0 behavior');
    console.log('V1.0 Issue: Uses fixed DEFAULT_FACTORS=[1000,1000] regardless of actual screen size');
    console.log('V1.5 Solution: Uses smartResizeFactors calculated from actual screen dimensions');
    console.log('');
    console.log('Impact: The larger the screen deviates from 1000x1000, the worse the coordinate errors become');

    // This test always passes - it's just for documentation
    expect(true).toBe(true);
  });
});

Test Results (Partial)

The test demonstrates how coordinate errors scale with screen size and position: "V1.0" means no uiTarsVersion when I create an instance of GUIAgent.

FHD Center FHD Bottom-right 4K Center
Resolution 1920×1080 1920×1080 3840×2160
Position Center Corner Center
Input Coords [960, 540] [1800, 900] [1920, 1080]
V1.0 Output [1843.2, 583.2] [3456, 972] [7372.8, 2332.8]
V1.0 Error (Width/Height) 46.0% / 4.0% 86.3% / 6.7% 142.0% / 58.0%
V1.5 Output [954.037, 534.066] [1788.82, 890.11] [1922.002, 1082.004]
V1.5 Error (Width/Height) 0.3% / 0.5% 0.6% / 0.9% 0.1% / 0.1%

Key Findings

  • Position matters in V1.0: Bottom-right corners show worse errors than center positions
  • 4K displays are severely affected: V1.0 produces coordinates that are off by >140% horizontally

This behavior is described in the issues. It's exactly the same problem as what I faced, when calling UI-TARS-SDK based on Basic Usage: https://github.com/bytedance/UI-TARS-desktop/tree/main/packages/ui-tars/sdk#basic-usage

ToDo

SDK users like me will no longer suffer from this problem if this PR is merged.

Yet, the issues are reported by UI-TARS-desktop users. apps\ui-tars\src\main\services\runAgent.ts should be fixed as well.

meme-dayo avatar May 30 '25 16:05 meme-dayo

I traced GUIAgent calling via UI-TARS-desktop.

What I thought

I guess UI-TARS-1.5 users forget or fail to set env

VLM_PROVIDER='Hugging Face for UI-TARS-1.5'

and default UI-TARS version: 1.0 is set which cause the issues

default:
      return UITarsModelVersion.V1_0;

The trace

apps\ui-tars\src\main\services\runAgent.ts

import { StatusEnum, UITarsModelVersion } from '@ui-tars/shared/types';
import { GUIAgent, type GUIAgentConfig } from '@ui-tars/sdk';
import { SettingStore } from '@main/store/setting';

import {
  AppState,
  SearchEngineForSettings,
  VLMProviderV2,
} from '@main/store/types';

const getModelVersion = (
  provider: VLMProviderV2 | undefined,
): UITarsModelVersion => {
  switch (provider) {
    case VLMProviderV2.ui_tars_1_5:
      return UITarsModelVersion.V1_5;
    case VLMProviderV2.ui_tars_1_0:
      return UITarsModelVersion.V1_0;
    case VLMProviderV2.doubao_1_5:
      return UITarsModelVersion.DOUBAO_1_5_15B;
    case VLMProviderV2.doubao_1_5_vl:
      return UITarsModelVersion.DOUBAO_1_5_20B;
    default:
      return UITarsModelVersion.V1_0; 
    // default version is V1.0, so UI-TARS-1.5 users must set UITarsModelVersion.V1_5 explicitly //
  }
};

export const runAgent = async (
  setState: (state: AppState) => void,
  getState: () => AppState,
) => {
  const settings = SettingStore.getStore();      // modelVersion comes from settings
  const modelVersion = getModelVersion(settings.vlmProvider); 

  //...
  const guiAgent = new GUIAgent({
    model: {
      baseURL: settings.vlmBaseUrl,
      apiKey: settings.vlmApiKey,
      model: settings.vlmModelName,
    },
    systemPrompt: getSpByModelVersion(modelVersion),
    logger,
    signal: abortController?.signal,
    operator: operator,
    onData: handleData,
    onError: (params) => {
      const { error } = params;
      logger.error('[onGUIAgentError]', settings, error);
      //...
    },
    retry: {
      //...
    },
    maxLoopCount: settings.maxLoopCount,
    loopIntervalInMs: settings.loopIntervalInMs,
    uiTarsVersion: modelVersion,       //  if UI-TARS-1.5 ⇒ must set: UITarsModelVersion.V1_5
  });
};

apps/ui-tars/src/main/store/types.ts

  export enum VLMProviderV2 {
  ui_tars_1_0 = 'Hugging Face for UI-TARS-1.0',
  ui_tars_1_5 = 'Hugging Face for UI-TARS-1.5',
  doubao_1_5 = 'VolcEngine Ark for Doubao-1.5-UI-TARS',
  doubao_1_5_vl = 'VolcEngine Ark for Doubao-1.5-thinking-vision-pro',
}

apps\ui-tars\src\main\store\setting.ts

import * as env from '@main/env';
import { LocalStore, SearchEngineForSettings, VLMProviderV2 } from './types';

export const DEFAULT_SETTING: LocalStore = {
  language: 'en',
  vlmProvider: (env.vlmProvider as VLMProviderV2) || '',     // VLMProviderV2 enum type otherwise; blank string
  vlmBaseUrl: env.vlmBaseUrl || '',
  vlmApiKey: env.vlmApiKey || '',
  vlmModelName: env.vlmModelName || '',
  //...
};

export class SettingStore {
  private static instance: ElectronStore<LocalStore>;

  public static getInstance(): ElectronStore<LocalStore> {
    if (!SettingStore.instance) {
      SettingStore.instance = new ElectronStore<LocalStore>({
        name: 'ui_tars.setting',
        defaults: DEFAULT_SETTING,
      });

      SettingStore.instance.onDidAnyChange((newValue, oldValue) => {
        //...
      });
    }
    return SettingStore.instance;
  }

  public static get<K extends keyof LocalStore>(key: K): LocalStore[K] {
    return SettingStore.getInstance().get(key);
  }

  public static getStore(): LocalStore {
    return SettingStore.getInstance().store;
  }
  //... other methods
}

apps/ui-tars/src/main/env.ts

import os from 'node:os';

import dotenv from 'dotenv';

dotenv.config();
//...
export const vlmProvider = process.env.VLM_PROVIDER;    // UI-TARS-1.5 user must set: 'Hugging Face for UI-TARS-1.5'
export const vlmBaseUrl = process.env.VLM_BASE_URL;
export const vlmApiKey = process.env.VLM_API_KEY;
export const vlmModelName = process.env.VLM_MODEL_NAME;

ToDo

  • [x] Understand env settings via UI-TARS-desktop

meme-dayo avatar May 31 '25 02:05 meme-dayo

New finding💡

Today I tested on the latest version (v0.1.2+) of UI-TARS-desktop with source code deploy. I opened settings and selected Hugging Face for UI-TARS-1.5 in pulldown list.

Yet, the coordinate issues (reported in v0.1.1) didn't occur !

Concluison

It seems no fix is needed for the UI-TARS-desktop regarding to the click(w,h) coordinate issues. This means only UI-TARS-SDK users still face this problem due to lack of uiTarsVersion at Basic Usage of GUIAgent. I wish this PR will be approved soon.

meme-dayo avatar May 31 '25 14:05 meme-dayo

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 6.85%. Comparing base (60fa69e) to head (a474a86). Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff            @@
##            main    #645      +/-   ##
========================================
+ Coverage   6.84%   6.85%   +0.01%     
========================================
  Files        303     303              
  Lines       9838    9838              
  Branches    1921    1921              
========================================
+ Hits         673     674       +1     
+ Misses      9060    9059       -1     
  Partials     105     105              

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov[bot] avatar Jun 03 '25 06:06 codecov[bot]

A very detailed analysis and explanation, thank you for your contribution.

ycjcl868 avatar Jun 05 '25 16:06 ycjcl868