feat(SDK): expose uiTarsVerison and update Basic Usage to click(w,h) correctly
Summary
Currently, the default version of UI-TARS model is 1.0 ! Therefore, invoking UI-TARS-1.5 without uiTarsVersion in GUIAgent causes issues #590 #591
I will offer test case soon.
Checklist
- [x] Added or updated necessary tests (Optional).
- [x] Updated documentation to align with changes (Optional).
- [x] Verified no breaking changes, or prepared solutions for any occurring breaking changes (Optional).
- [ ] My change does not involve the above items.
UI-TARS-1.5 Coordinate Parsing Default Version Issue
Summary
Currently, the default version of UI-TARS model is 1.0! Therefore, invoking UI-TARS-1.5 without uiTarsVersion in GUIAgent causes coordinate parsing issues. This PR addresses the coordinate calculation discrepancies between V1.0 and V1.5 implementations.
Problem Description
When users specify a HuggingFace UI-TARS-1.5 model name and expect V1.5 behavior but omit the explicit uiTarsVersion parameter, the system defaults to V1.0 behavior. This causes significant coordinate calculation errors because:
- V1.0 (default): Uses fixed
DEFAULT_FACTORS=[1000, 1000]regardless of actual screen dimensions - V1.5 (explicit): Uses
smartResizeFactorscalculated from actual screen dimensions
The coordinate errors scale with:
- Screen size deviation from 1000x1000: Larger screens show exponentially worse errors
- Click position: Bottom-right positions show larger errors than top-left positions
Test Implementation
How to Run the Test
- File Location: Place the test file at the same level as packages/ui-tars/action-parser/test/
- Test File Name: parseActionVlm.error.demo.test.ts
- Run Command:
pnpm run test parseActionVlm.error.demo.test.ts
Test Code
Following is a code to demonstrate a gap between output of UI-TARS-1.5 vs actual coordinate:
parseActionVlm.error.demo.test.ts
/*
* Comprehensive test for UI-TARS-1.5 coordinate parsing verification
* This test demonstrates the coordinate calculation differences between V1.0 and V1.5
*
* V1.0 (default): Uses fixed DEFAULT_FACTORS=[1000,1000] causing screen-size dependent errors
* V1.5 (explicit): Uses smartResizeFactors calculated for actual screen dimensions - the correct approach
*/
import { describe, it, expect } from 'vitest';
import { parseActionVlm } from '../src/actionParser';
import { UITarsModelVersion } from '@ui-tars/shared/types';
describe('parseActionVlm - UI-TARS-1.5 Coordinate Fix Verification', () => {
/**
* Helper function to parse and log coordinate results for both versions
*/
const parseAndLogBothVersions = (input, description, physicalScreen, scaleFactor = 1.0) => {
// Calculate virtual (logical) screen dimensions
const virtualScreen = {
width: physicalScreen.width / scaleFactor,
height: physicalScreen.height / scaleFactor
};
console.log(`\n\n=== ${description} ===`);
console.log('Input:', input);
console.log('Physical:', physicalScreen, 'Virtual:', virtualScreen, 'Scale:', scaleFactor);
// Parse with V1.0 (default behavior - uses fixed DEFAULT_FACTORS=[1000,1000])
const resultV10 = parseActionVlm(
input,
[1000, 1000],
'bc',
virtualScreen,
scaleFactor,
UITarsModelVersion.V1_0
);
// Parse with V1.5 (explicit version - uses smartResizeFactors for accurate coordinates)
const resultV15 = parseActionVlm(
input,
[1000, 1000],
'bc',
virtualScreen,
scaleFactor,
UITarsModelVersion.V1_5
);
// Helper function to add color coding for error rates
const formatErrorRate = (errorRate) => {
if (errorRate >= 5) return `\x1b[91m${errorRate.toFixed(1)}%\x1b[0m`; // Bright red for >=5%
return `\x1b[92m${errorRate.toFixed(1)}%\x1b[0m`; // Green for <5%
};
// Extract input coordinates and convert to physical screen coordinates
const inputX = parseInt(input.match(/\[(\d+)/)?.[1] || '0');
const inputY = parseInt(input.match(/,\s*(\d+)/)?.[1] || '0');
const physicalInputX = inputX * scaleFactor;
const physicalInputY = inputY * scaleFactor;
// Extract and log V1.0 results (shows screen-size dependent errors)
const actionV10 = resultV10[0];
const boxV10 = JSON.parse(actionV10.action_inputs.start_box);
const coordsV10 = actionV10.action_inputs.start_coords;
const errorXV10 = ((Math.abs(coordsV10[0] - physicalInputX) / physicalScreen.width) * 100);
const errorYV10 = ((Math.abs(coordsV10[1] - physicalInputY) / physicalScreen.height) * 100);
console.log('\nV1.0 Results (DEFAULT_FACTORS=[1000,1000] - causes screen-size dependent errors):');
console.log(' Normalized box:', boxV10.map(v => parseFloat(v.toFixed(4))));
console.log(' Screen coords:', coordsV10, `<- w: ${formatErrorRate(errorXV10)}, h: ${formatErrorRate(errorYV10)} error`);
// Extract and log V1.5 results (correct approach)
const actionV15 = resultV15[0];
const boxV15 = JSON.parse(actionV15.action_inputs.start_box);
const coordsV15 = actionV15.action_inputs.start_coords;
const errorXV15 = ((Math.abs(coordsV15[0] - physicalInputX) / physicalScreen.width) * 100);
const errorYV15 = ((Math.abs(coordsV15[1] - physicalInputY) / physicalScreen.height) * 100);
console.log('\nV1.5 Results (smartResizeFactors - correct coordinate calculation):');
console.log(' Normalized box:', boxV15.map(v => parseFloat(v.toFixed(4))));
console.log(' Screen coords:', coordsV15, `<- w: ${formatErrorRate(errorXV15)}, h: ${formatErrorRate(errorYV15)} error`);
return { actionV10, actionV15, boxV10, boxV15 };
};
it('should demonstrate coordinate calculation differences on 1920x1080 display', () => {
const screenContext = { width: 1920, height: 1080 };
const scaleFactor = 1.0;
// Test center position click
const centerInput = "Action: click(start_box='[960, 540, 960, 540]')";
const { actionV10, actionV15, boxV10, boxV15 } = parseAndLogBothVersions(
centerInput,
'FHD 1920x1080 Center Position Test',
screenContext,
scaleFactor
);
// Verify both parsed successfully
expect(actionV10.action_type).toBe('click');
expect(actionV15.action_type).toBe('click');
// V1.0 shows incorrect normalization due to fixed factors
expect(boxV10[0]).toBeCloseTo(0.96, 3); // Should be 0.5 but shows 0.96
expect(boxV10[1]).toBeCloseTo(0.54, 3); // Should be 0.5 but shows 0.54
// V1.5 provides correct normalization
expect(boxV15[0]).toBeCloseTo(0.5, 1); // Correctly normalized to center
expect(boxV15[1]).toBeCloseTo(0.5, 1); // Correctly normalized to center
// Test corner positions to show the extent of the error
const positions = [
{ name: 'Top-left corner', coords: '[200, 100, 200, 100]' },
{ name: 'Bottom-right corner', coords: '[1800, 900, 1800, 900]' }
];
positions.forEach(({ name, coords }) => {
const input = `Action: click(start_box='${coords}')`;
const { actionV10, actionV15 } = parseAndLogBothVersions(
input,
`FHD 1920x1080 ${name}`,
screenContext,
scaleFactor
);
// Verify parsing succeeded for both versions
expect(actionV10.action_type).toBe('click');
expect(actionV15.action_type).toBe('click');
});
});
it('should demonstrate coordinate calculation differences on WQHD 2560x1440 with 1.25 scale', () => {
const physicalScreen = { width: 2560, height: 1440 };
const scaleFactor = 1.25;
// Test center position click - input coordinates are for virtual screen (2048x1152)
const centerInput = "Action: click(start_box='[1024, 576, 1024, 576]')";
const { actionV10, actionV15, boxV10, boxV15 } = parseAndLogBothVersions(
centerInput,
'WQHD 2560x1440 Center Position Test (Scale 1.25)',
physicalScreen,
scaleFactor
);
// Verify both parsed successfully
expect(actionV10.action_type).toBe('click');
expect(actionV15.action_type).toBe('click');
// V1.0 will show different errors compared to FHD due to screen size dependency
expect(boxV10[0]).toBeCloseTo(1.024, 2); // Error on virtual screen size
expect(boxV10[1]).toBeCloseTo(0.576, 2); // Different error pattern
// V1.5 provides correct normalization regardless of screen size
expect(boxV15[0]).toBeCloseTo(0.5, 1);
expect(boxV15[1]).toBeCloseTo(0.5, 1);
});
it('should demonstrate coordinate calculation differences on 4K 3840x2160 display', () => {
const screenContext = { width: 3840, height: 2160 };
const scaleFactor = 1.0;
// Test center position click - the error will be even more pronounced on 4K
const centerInput = "Action: click(start_box='[1920, 1080, 1920, 1080]')";
const { actionV10, actionV15, boxV10, boxV15 } = parseAndLogBothVersions(
centerInput,
'4K 3840x2160 Center Position Test',
screenContext,
scaleFactor
);
// Verify both parsed successfully
expect(actionV10.action_type).toBe('click');
expect(actionV15.action_type).toBe('click');
// V1.0 shows severe errors on 4K due to screen size dependency
expect(boxV10[0]).toBeCloseTo(1.92, 2); // Massive error - nearly double!
expect(boxV10[1]).toBeCloseTo(1.08, 2); // Also exceeds normalized range
// V1.5 maintains correct normalization even on 4K
expect(boxV15[0]).toBeCloseTo(0.5, 1);
expect(boxV15[1]).toBeCloseTo(0.5, 1);
});
it('should demonstrate the root cause and solution', () => {
console.log('\n\n=== Root Cause Analysis ===');
console.log('Problem: UI-TARS model 1.5 without explicit version defaults to V1.0 behavior');
console.log('V1.0 Issue: Uses fixed DEFAULT_FACTORS=[1000,1000] regardless of actual screen size');
console.log('V1.5 Solution: Uses smartResizeFactors calculated from actual screen dimensions');
console.log('');
console.log('Impact: The larger the screen deviates from 1000x1000, the worse the coordinate errors become');
// This test always passes - it's just for documentation
expect(true).toBe(true);
});
});
Test Results (Partial)
The test demonstrates how coordinate errors scale with screen size and position:
"V1.0" means no uiTarsVersion when I create an instance of GUIAgent.
| FHD Center | FHD Bottom-right | 4K Center | |
|---|---|---|---|
| Resolution | 1920×1080 | 1920×1080 | 3840×2160 |
| Position | Center | Corner | Center |
| Input Coords | [960, 540] |
[1800, 900] |
[1920, 1080] |
| V1.0 Output | [1843.2, 583.2] |
[3456, 972] |
[7372.8, 2332.8] |
| V1.0 Error (Width/Height) | 46.0% / 4.0% | 86.3% / 6.7% | 142.0% / 58.0% |
| V1.5 Output | [954.037, 534.066] |
[1788.82, 890.11] |
[1922.002, 1082.004] |
| V1.5 Error (Width/Height) | 0.3% / 0.5% | 0.6% / 0.9% | 0.1% / 0.1% |
Key Findings
- Position matters in V1.0: Bottom-right corners show worse errors than center positions
- 4K displays are severely affected: V1.0 produces coordinates that are off by >140% horizontally
This behavior is described in the issues. It's exactly the same problem as what I faced, when calling UI-TARS-SDK based on Basic Usage: https://github.com/bytedance/UI-TARS-desktop/tree/main/packages/ui-tars/sdk#basic-usage
ToDo
SDK users like me will no longer suffer from this problem if this PR is merged.
Yet, the issues are reported by UI-TARS-desktop users. apps\ui-tars\src\main\services\runAgent.ts should be fixed as well.
I traced GUIAgent calling via UI-TARS-desktop.
What I thought
I guess UI-TARS-1.5 users forget or fail to set env
VLM_PROVIDER='Hugging Face for UI-TARS-1.5'
and default UI-TARS version: 1.0 is set which cause the issues
default:
return UITarsModelVersion.V1_0;
The trace
apps\ui-tars\src\main\services\runAgent.ts
import { StatusEnum, UITarsModelVersion } from '@ui-tars/shared/types';
import { GUIAgent, type GUIAgentConfig } from '@ui-tars/sdk';
import { SettingStore } from '@main/store/setting';
import {
AppState,
SearchEngineForSettings,
VLMProviderV2,
} from '@main/store/types';
const getModelVersion = (
provider: VLMProviderV2 | undefined,
): UITarsModelVersion => {
switch (provider) {
case VLMProviderV2.ui_tars_1_5:
return UITarsModelVersion.V1_5;
case VLMProviderV2.ui_tars_1_0:
return UITarsModelVersion.V1_0;
case VLMProviderV2.doubao_1_5:
return UITarsModelVersion.DOUBAO_1_5_15B;
case VLMProviderV2.doubao_1_5_vl:
return UITarsModelVersion.DOUBAO_1_5_20B;
default:
return UITarsModelVersion.V1_0;
// default version is V1.0, so UI-TARS-1.5 users must set UITarsModelVersion.V1_5 explicitly //
}
};
export const runAgent = async (
setState: (state: AppState) => void,
getState: () => AppState,
) => {
const settings = SettingStore.getStore(); // modelVersion comes from settings
const modelVersion = getModelVersion(settings.vlmProvider);
//...
const guiAgent = new GUIAgent({
model: {
baseURL: settings.vlmBaseUrl,
apiKey: settings.vlmApiKey,
model: settings.vlmModelName,
},
systemPrompt: getSpByModelVersion(modelVersion),
logger,
signal: abortController?.signal,
operator: operator,
onData: handleData,
onError: (params) => {
const { error } = params;
logger.error('[onGUIAgentError]', settings, error);
//...
},
retry: {
//...
},
maxLoopCount: settings.maxLoopCount,
loopIntervalInMs: settings.loopIntervalInMs,
uiTarsVersion: modelVersion, // if UI-TARS-1.5 ⇒ must set: UITarsModelVersion.V1_5
});
};
apps/ui-tars/src/main/store/types.ts
export enum VLMProviderV2 {
ui_tars_1_0 = 'Hugging Face for UI-TARS-1.0',
ui_tars_1_5 = 'Hugging Face for UI-TARS-1.5',
doubao_1_5 = 'VolcEngine Ark for Doubao-1.5-UI-TARS',
doubao_1_5_vl = 'VolcEngine Ark for Doubao-1.5-thinking-vision-pro',
}
apps\ui-tars\src\main\store\setting.ts
import * as env from '@main/env';
import { LocalStore, SearchEngineForSettings, VLMProviderV2 } from './types';
export const DEFAULT_SETTING: LocalStore = {
language: 'en',
vlmProvider: (env.vlmProvider as VLMProviderV2) || '', // VLMProviderV2 enum type otherwise; blank string
vlmBaseUrl: env.vlmBaseUrl || '',
vlmApiKey: env.vlmApiKey || '',
vlmModelName: env.vlmModelName || '',
//...
};
export class SettingStore {
private static instance: ElectronStore<LocalStore>;
public static getInstance(): ElectronStore<LocalStore> {
if (!SettingStore.instance) {
SettingStore.instance = new ElectronStore<LocalStore>({
name: 'ui_tars.setting',
defaults: DEFAULT_SETTING,
});
SettingStore.instance.onDidAnyChange((newValue, oldValue) => {
//...
});
}
return SettingStore.instance;
}
public static get<K extends keyof LocalStore>(key: K): LocalStore[K] {
return SettingStore.getInstance().get(key);
}
public static getStore(): LocalStore {
return SettingStore.getInstance().store;
}
//... other methods
}
apps/ui-tars/src/main/env.ts
import os from 'node:os';
import dotenv from 'dotenv';
dotenv.config();
//...
export const vlmProvider = process.env.VLM_PROVIDER; // UI-TARS-1.5 user must set: 'Hugging Face for UI-TARS-1.5'
export const vlmBaseUrl = process.env.VLM_BASE_URL;
export const vlmApiKey = process.env.VLM_API_KEY;
export const vlmModelName = process.env.VLM_MODEL_NAME;
ToDo
- [x] Understand env settings via UI-TARS-desktop
New finding💡
Today I tested on the latest version (v0.1.2+) of UI-TARS-desktop with source code deploy.
I opened settings and selected Hugging Face for UI-TARS-1.5 in pulldown list.
Yet, the coordinate issues (reported in v0.1.1) didn't occur !
Concluison
It seems no fix is needed for the UI-TARS-desktop regarding to the click(w,h) coordinate issues. This means only UI-TARS-SDK users still face this problem due to lack of uiTarsVersion at Basic Usage of GUIAgent. I wish this PR will be approved soon.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 6.85%. Comparing base (
60fa69e) to head (a474a86). Report is 1 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #645 +/- ##
========================================
+ Coverage 6.84% 6.85% +0.01%
========================================
Files 303 303
Lines 9838 9838
Branches 1921 1921
========================================
+ Hits 673 674 +1
+ Misses 9060 9059 -1
Partials 105 105
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
A very detailed analysis and explanation, thank you for your contribution.