UI-TARS icon indicating copy to clipboard operation
UI-TARS copied to clipboard

Confuse about ActionSpace for MobileUse

Open nordysu opened this issue 8 months ago • 7 comments

Action space in README_v1.md for Mobile is:

click(start_box='<|box_start|>(x1,y1)<|box_end|>')
long_press(start_box='<|box_start|>(x1,y1)<|box_end|>', time='')
type(content='')
scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>')
press_home()
press_back()
finished(content='') # Submit the task regardless of whether it succeeds or fails.

And in prompts.py for Mobile is:

click(start_box='<|box_start|>(x1,y1)<|box_end|>')
long_press(start_box='<|box_start|>(x1,y1)<|box_end|>')
type(content='') #If you want to submit your input, use "\\n" at the end of `content`.
scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', direction='down or up or right or left')
open_app(app_name=\'\')
drag(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>')
press_home()
press_back()
finished(content='xxx') # Use escape characters \\', \\", and \\n in content part to ensure we can parse the content in normal python string format.

Confused about action scroll and drag in prompt.py, cause these two actions are for compter in previous README_v1.md.

Is that a mistake?

nordysu avatar Apr 18 '25 09:04 nordysu

That is correct. In the training of UI-TARS-1.5, we have optimized the action space for mobile scenarios, and you can directly use the latest prompt.

JjjFangg avatar Apr 21 '25 05:04 JjjFangg

That is correct. In the training of UI-TARS-1.5, we have optimized the action space for mobile scenarios, and you can directly use the latest prompt.

what's the difference between scroll and drag, how to achieve drag action?

Coke-2 avatar Apr 21 '25 12:04 Coke-2

That is correct. In the training of UI-TARS-1.5, we have optimized the action space for mobile scenarios, and you can directly use the latest prompt.

So, In which scenario we can expect scroll, and which scenario we can expect drag?

Since drag action can do the same thing as scroll action.

nordysu avatar Apr 22 '25 08:04 nordysu

Add: for open_app(), what app_name should we expect for? app_name is android package name?

nordysu avatar Apr 22 '25 09:04 nordysu

That is correct. In the training of UI-TARS-1.5, we have optimized the action space for mobile scenarios, and you can directly use the latest prompt.

So, In which scenario we can expect scroll, and which scenario we can expect drag?

Since drag action can do the same thing as scroll action.

@nordysu you can check the source code here for the definitions of drag and scroll for mobile phones

JH-ninjatech avatar May 06 '25 01:05 JH-ninjatech

In this reply, the definition of scrolling for a mobile phone is opposite: when the direction is up, it means to scroll down, and the y-value increases. So which one should be referred to specifically? https://github.com/bytedance/UI-TARS/issues/129#issuecomment-2817688473

Coke-2 avatar May 06 '25 02:05 Coke-2

That is correct. In the training of UI-TARS-1.5, we have optimized the action space for mobile scenarios, and you can directly use the latest prompt.

So, In which scenario we can expect scroll, and which scenario we can expect drag? Since drag action can do the same thing as scroll action.

@nordysu you can check the source code here for the definitions of drag and scroll for mobile phones

Got it. drag and scroll can do the same thing, in difference way.

for action open_app sometime I got chinese name like ‘淘宝’, but not android app package name and activity. Do you train the model to return android app package name?

nordysu avatar May 11 '25 10:05 nordysu