UI-TARS icon indicating copy to clipboard operation
UI-TARS copied to clipboard

为什么测试android_control时,预测的scroll方向总是与指令相反呢?

Open manmushanhe opened this issue 9 months ago • 4 comments

instructions

["Click on the Romanticism art", "Swipe up and learn more about Romanticism art", "Swipe up and learn more about Romanticism art", "Swipe up and learn more about Romanticism art", "Swipe up and learn more about Romanticism art"]

result

[["Action: click(start_box='(259,314)')"], ["Action: scroll(direction='down')"], ["Action: scroll(direction='down')"], ["Action: scroll(direction='down')"], ["Action: scroll(direction='down')"]]

prompts

"""You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task. ## Output Format

Action: ...

## Action Space {action_space}

## User Instruction {instruction} """

action_space

""" click(start_box='<|box_start|>(x1,y1)<|box_end|>') long_press(start_box='<|box_start|>(x1,y1)<|box_end|>', time='') type(content='') scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', direction='down or up or right or left') press_back() wait() #Sleep for 5s and take a screenshot to check for any changes. """

manmushanhe avatar Mar 03 '25 07:03 manmushanhe

We recommend trying the following prompt format. When providing the Thought, use the format mentioned in the prompt to guide the model in predicting the Action(e.g. Thought: Click on the Romanticism art\nAction: ....). Additionally, we suggest conducting mobile scenario experiments on the SFT version of the model.

You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.

Output Format

Thought: ... Action: ...

Action Space

click(start_box='[x1, y1, x2, y2]') long_press(start_box='[x1, y1, x2, y2]', time='') type(content='') scroll(direction='down or up or right or left') open_app(app_name='') press_back() press_home() wait() finished() # Submit the task regardless of whether it succeeds or fails.

Note

  • Use English in Thought part.
  • Summarize your next action (with its target element) in one sentence in Thought part.

User Instruction

Make the Copy of Office Pic in the Drive app

By structuring the prompt in this way, the model can better understand the formatting requirements and predict actions more effectively. Let us know if you have any further questions! 🚀

JjjFangg avatar Mar 04 '25 02:03 JjjFangg

I have a question, is the direction of the instruction and the direction of the label opposite during training? That is, if the instruction is scrolling up, the actual model output is in the opposite direction.

manmushanhe avatar Mar 05 '25 03:03 manmushanhe

@JjjFangg In the paper, it is mentioned that UI-TARS integrates multiple existing datasets (such as MM-Mind2Web, GUIAct, AITW, AITZ, AndroidControl, GUI-Odyssey, AMEX, etc.) and standardizes their action spaces into a unified format. I have two questions regarding this process:

  1. Action Space Unification Method:

    • How exactly are the action representations from these different datasets unified into UI-TARS's action space? For example, different datasets may have different action definitions and parameter formats. How does UI-TARS handle these differences?
    • Is there a detailed mapping rule or code snippet that can be referenced?
  2. Dataset Splitting:

    • During the training of UI-TARS, were the test set data from these open-source datasets also included in the training set? If so, this could potentially affect the model's evaluation results.
    • Can you clarify how the training and test sets of these datasets are split, and whether test set data was used during training?

Suggestions:

  • Provide a detailed document or code snippet explaining how the action spaces from different datasets are unified into UI-TARS's action space.
  • Clarify the dataset splitting situation, especially whether test set data was used in the training process.

lgy0404 avatar Mar 13 '25 03:03 lgy0404

@JjjFangg I met similar issue as following, the model can't follow "swipe right" instruction,an issue submitted as https://github.com/bytedance/UI-TARS/issues/103. As you comment " Additionally, we suggest conducting mobile scenario experiments on the SFT version of the model.“ and I will try SFT version, could you please let us know the difference of two models?

TSHunterY avatar Apr 02 '25 11:04 TSHunterY