UI-TARS icon indicating copy to clipboard operation
UI-TARS copied to clipboard

android control test question

Open manmushanhe opened this issue 9 months ago • 7 comments

在使用7B和2B模型进行android control进行测试时,SR分别是82.7%和77.3%,与论文中的90.8%和89.3%相差较多。尝试按照提供的osworld脚本加入history,效果也没有达到论文中的指标,请问可以公开一下android control的具体测试过程吗。 When using the 7B and 2B models for android control testing, the SR is 82.7% and 77.3% respectively, which is much different from the 90.8% and 89.3% in the paper. I tried to add history according to the osworld script provided, but the effect did not meet the indicators in the paper. Could you please disclose the specific testing process of android control?

manmushanhe avatar Mar 06 '25 09:03 manmushanhe

We followed the settings from OS-Atlas, and a predicted point within 14% of the screen distance from the ground truth (GT) is considered correct. For low-level evaluation, we provide ground truth thought for model to predict action.

JjjFangg avatar Mar 07 '25 03:03 JjjFangg

Does the ground truth thought refer to low level instrcution and action labels? Do you add a low level directive and an action label to history?

manmushanhe avatar Mar 07 '25 03:03 manmushanhe

For high-level tasks, the model predicts both "Thought: ..." and "Action: ...". For low-level tasks, the "Thought: ..." part is provided, and the model only needs to predict "Action: ...".

JjjFangg avatar Mar 07 '25 04:03 JjjFangg

You're talking about adding thought to history as part of the model response, using the low level instruction as thought. But in the prompt format, output only action and not thought?

manmushanhe avatar Mar 07 '25 06:03 manmushanhe

Here are examples of both low-level and high-level cases (where the model predicts the answer for the last turn).

low-level: [ { "role": "system", "content": "You are a helpful assistant.", }, { "role": "user", "content": "You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task. \n\n## Output Format\n\nThought: ...\nAction: ...\n\n\n## Action Space\nclick(start_box='<|box_start|>(x1,y1)<|box_end|>')\nlong_press(start_box='<|box_start|>(x1,y1)<|box_end|>', time='')\ntype(content='')\nscroll(direction='down or up or right or left')\nopen_app(app_name='')\npress_back()\npress_home()\nwait()\nfinished() # Submit the task regardless of whether it succeeds or fails.\n\n## Note\n- Use English in Thought part.\n\n- Summarize your next action (with its target element) in one sentence in Thought part.\n\n## User Instruction\nMake the Copy of Office Pic in the Drive app" }, { "role": "user", "content": "<|vision_start|><|image_pad|>...<|image_pad|><|vision_end|>" }, { "role": "assistant", "content": "Thought: Click on the three dots button of the office pic\nAction: click(start_box='<|box_start|>(66,851),(66,851)<|box_end|>')" }, { "role": "user", "content": "<|vision_start|><|image_pad|>...<|image_pad|><|vision_end|>" }, { "role": "assistant", "content": "Thought: Click on Make a copy option\nAction:" } ]

high-level: [ { "role": "system", "content": "You are a helpful assistant.", }, { "role": "user", "content": "You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task. \n\n## Output Format\n\nThought: ...\nAction: ...\n\n\n## Action Space\nclick(start_box='<|box_start|>(x1,y1)<|box_end|>')\nlong_press(start_box='<|box_start|>(x1,y1)<|box_end|>', time='')\ntype(content='')\nscroll(direction='down or up or right or left')\nopen_app(app_name='')\npress_back()\npress_home()\nwait()\nfinished() # Submit the task regardless of whether it succeeds or fails.\n\n## Note\n- Use English in Thought part.\n\n- Summarize your next action (with its target element) in one sentence in Thought part.\n\n## User Instruction\nMake the Copy of Office Pic in the Drive app" }, { "role": "user", "content": "<|vision_start|><|image_pad|>...<|image_pad|><|vision_end|>" }, { "role": "assistant", "content": "Thought: Click on the three dots button of the office pic\nAction: click(start_box='<|box_start|>(66,851),(66,851)<|box_end|>')" }, { "role": "user", "content": "<|vision_start|><|image_pad|>...<|image_pad|><|vision_end|>" }, { "role": "assistant", "content": "" } ]

JjjFangg avatar Mar 09 '25 04:03 JjjFangg

For high-level tasks, the model predicts both "Thought: ..." and "Action: ...". For low-level tasks, the "Thought: ..." part is provided, and the model only needs to predict "Action: ...". 请问中间的这种轮空对话如何理解呢? { "role": "user", "content": "" }, { "role": "assistant", "content": "" } 这是数据构造的问题吗

MickeyFei avatar Mar 13 '25 12:03 MickeyFei

For high-level tasks, the model predicts both "Thought: ..." and "Action: ...". For low-level tasks, the "Thought: ..." part is provided, and the model only needs to predict "Action: ...". 请问中间的这种轮空对话如何理解呢? { "role": "user", "content": "" }, { "role": "assistant", "content": "" } 这是数据构造的问题吗

The plain content from the user consists of image tokens, which are not displayed in the comment.

JjjFangg avatar Mar 24 '25 12:03 JjjFangg

在使用7B和2B模型进行android control进行测试时,SR分别是82.7%和77.3%,与论文中的90.8%和89.3%相差较多。尝试按照提供的osworld脚本加入history,效果也没有达到论文中的指标,请问可以公开一下android control的具体测试过程吗。 When using the 7B and 2B models for android control testing, the SR is 82.7% and 77.3% respectively, which is much different from the 90.8% and 89.3% in the paper. I tried to add history according to the osworld script provided, but the effect did not meet the indicators in the paper. Could you please disclose the specific testing process of android control?

@manmushanhe Have you finally managed to reproduce the results in the paper? Actually, the metrics I obtained are even lower than yours 82.7% and 77.3%. Could you share your test script file? Thank you!

Shaokang-Agent avatar May 20 '25 07:05 Shaokang-Agent

@manmushanhe Hello, I also encountered some problems when reproducing the sales results of this paper. Can you share your test code? Thank you very much.

JYX1216 avatar Jul 14 '25 08:07 JYX1216

@JYX1216 Please wait a moment. I will make a repository for reproduction public.

manmushanhe avatar Jul 16 '25 01:07 manmushanhe

@Shaokang-Agent @JYX1216 Open-GUI-Agent

manmushanhe avatar Jul 16 '25 02:07 manmushanhe

@manmushanhe Thank you very much for sharing the test code. It will be very helpful for me.

JYX1216 avatar Jul 16 '25 02:07 JYX1216

@Shaokang-Agent @JYX1216 Open-GUI-Agent

Hello, your work is very helpful, but the repository link you shared is no longer valid. Could you please share it again?

little-pikachu avatar Aug 14 '25 13:08 little-pikachu