Confuse about ActionSpace for MobileUse
Action space in README_v1.md for Mobile is:
click(start_box='<|box_start|>(x1,y1)<|box_end|>')
long_press(start_box='<|box_start|>(x1,y1)<|box_end|>', time='')
type(content='')
scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>')
press_home()
press_back()
finished(content='') # Submit the task regardless of whether it succeeds or fails.
And in prompts.py for Mobile is:
click(start_box='<|box_start|>(x1,y1)<|box_end|>')
long_press(start_box='<|box_start|>(x1,y1)<|box_end|>')
type(content='') #If you want to submit your input, use "\\n" at the end of `content`.
scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', direction='down or up or right or left')
open_app(app_name=\'\')
drag(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>')
press_home()
press_back()
finished(content='xxx') # Use escape characters \\', \\", and \\n in content part to ensure we can parse the content in normal python string format.
Confused about action scroll and drag in prompt.py, cause these two actions are for compter in previous README_v1.md.
Is that a mistake?
That is correct. In the training of UI-TARS-1.5, we have optimized the action space for mobile scenarios, and you can directly use the latest prompt.
That is correct. In the training of UI-TARS-1.5, we have optimized the action space for mobile scenarios, and you can directly use the latest prompt.
what's the difference between scroll and drag, how to achieve drag action?
That is correct. In the training of UI-TARS-1.5, we have optimized the action space for mobile scenarios, and you can directly use the latest prompt.
So, In which scenario we can expect scroll, and which scenario we can expect drag?
Since drag action can do the same thing as scroll action.
Add: for open_app(), what app_name should we expect for? app_name is android package name?
That is correct. In the training of UI-TARS-1.5, we have optimized the action space for mobile scenarios, and you can directly use the latest prompt.
So, In which scenario we can expect
scroll, and which scenario we can expectdrag?Since
dragaction can do the same thing asscrollaction.
@nordysu you can check the source code here for the definitions of drag and scroll for mobile phones
In this reply, the definition of scrolling for a mobile phone is opposite: when the direction is up, it means to scroll down, and the y-value increases. So which one should be referred to specifically? https://github.com/bytedance/UI-TARS/issues/129#issuecomment-2817688473
That is correct. In the training of UI-TARS-1.5, we have optimized the action space for mobile scenarios, and you can directly use the latest prompt.
So, In which scenario we can expect
scroll, and which scenario we can expectdrag? Sincedragaction can do the same thing asscrollaction.@nordysu you can check the source code here for the definitions of
dragandscrollfor mobile phones
Got it. drag and scroll can do the same thing, in difference way.
for action open_app sometime I got chinese name like ‘淘宝’, but not android app package name and activity. Do you train the model to return android app package name?