android control test question
在使用7B和2B模型进行android control进行测试时,SR分别是82.7%和77.3%,与论文中的90.8%和89.3%相差较多。尝试按照提供的osworld脚本加入history,效果也没有达到论文中的指标,请问可以公开一下android control的具体测试过程吗。 When using the 7B and 2B models for android control testing, the SR is 82.7% and 77.3% respectively, which is much different from the 90.8% and 89.3% in the paper. I tried to add history according to the osworld script provided, but the effect did not meet the indicators in the paper. Could you please disclose the specific testing process of android control?
We followed the settings from OS-Atlas, and a predicted point within 14% of the screen distance from the ground truth (GT) is considered correct. For low-level evaluation, we provide ground truth thought for model to predict action.
Does the ground truth thought refer to low level instrcution and action labels? Do you add a low level directive and an action label to history?
For high-level tasks, the model predicts both "Thought: ..." and "Action: ...". For low-level tasks, the "Thought: ..." part is provided, and the model only needs to predict "Action: ...".
You're talking about adding thought to history as part of the model response, using the low level instruction as thought. But in the prompt format, output only action and not thought?
Here are examples of both low-level and high-level cases (where the model predicts the answer for the last turn).
low-level:
[
{
"role": "system",
"content": "You are a helpful assistant.",
},
{
"role": "user",
"content": "You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task. \n\n## Output Format\n\nThought: ...\nAction: ...\n\n\n## Action Space\nclick(start_box='<|box_start|>(x1,y1)<|box_end|>')\nlong_press(start_box='<|box_start|>(x1,y1)<|box_end|>', time='')\ntype(content='')\nscroll(direction='down or up or right or left')\nopen_app(app_name='')\npress_back()\npress_home()\nwait()\nfinished() # Submit the task regardless of whether it succeeds or fails.\n\n## Note\n- Use English in Thought part.\n\n- Summarize your next action (with its target element) in one sentence in Thought part.\n\n## User Instruction\nMake the Copy of Office Pic in the Drive app"
},
{
"role": "user",
"content": "<|vision_start|><|image_pad|>...<|image_pad|><|vision_end|>"
},
{
"role": "assistant",
"content": "Thought: Click on the three dots button of the office pic\nAction: click(start_box='<|box_start|>(66,851),(66,851)<|box_end|>')"
},
{
"role": "user",
"content": "<|vision_start|><|image_pad|>...<|image_pad|><|vision_end|>"
},
{
"role": "assistant",
"content": "Thought: Click on Make a copy option\nAction:"
}
]
high-level:
[
{
"role": "system",
"content": "You are a helpful assistant.",
},
{
"role": "user",
"content": "You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task. \n\n## Output Format\n\nThought: ...\nAction: ...\n\n\n## Action Space\nclick(start_box='<|box_start|>(x1,y1)<|box_end|>')\nlong_press(start_box='<|box_start|>(x1,y1)<|box_end|>', time='')\ntype(content='')\nscroll(direction='down or up or right or left')\nopen_app(app_name='')\npress_back()\npress_home()\nwait()\nfinished() # Submit the task regardless of whether it succeeds or fails.\n\n## Note\n- Use English in Thought part.\n\n- Summarize your next action (with its target element) in one sentence in Thought part.\n\n## User Instruction\nMake the Copy of Office Pic in the Drive app"
},
{
"role": "user",
"content": "<|vision_start|><|image_pad|>...<|image_pad|><|vision_end|>"
},
{
"role": "assistant",
"content": "Thought: Click on the three dots button of the office pic\nAction: click(start_box='<|box_start|>(66,851),(66,851)<|box_end|>')"
},
{
"role": "user",
"content": "<|vision_start|><|image_pad|>...<|image_pad|><|vision_end|>"
},
{
"role": "assistant",
"content": ""
}
]
For high-level tasks, the model predicts both "Thought: ..." and "Action: ...". For low-level tasks, the "Thought: ..." part is provided, and the model only needs to predict "Action: ...". 请问中间的这种轮空对话如何理解呢? { "role": "user", "content": "" }, { "role": "assistant", "content": "" } 这是数据构造的问题吗
For high-level tasks, the model predicts both "Thought: ..." and "Action: ...". For low-level tasks, the "Thought: ..." part is provided, and the model only needs to predict "Action: ...". 请问中间的这种轮空对话如何理解呢? { "role": "user", "content": "" }, { "role": "assistant", "content": "" } 这是数据构造的问题吗
The plain content from the user consists of image tokens, which are not displayed in the comment.
在使用7B和2B模型进行android control进行测试时,SR分别是82.7%和77.3%,与论文中的90.8%和89.3%相差较多。尝试按照提供的osworld脚本加入history,效果也没有达到论文中的指标,请问可以公开一下android control的具体测试过程吗。 When using the 7B and 2B models for android control testing, the SR is 82.7% and 77.3% respectively, which is much different from the 90.8% and 89.3% in the paper. I tried to add history according to the osworld script provided, but the effect did not meet the indicators in the paper. Could you please disclose the specific testing process of android control?
@manmushanhe Have you finally managed to reproduce the results in the paper? Actually, the metrics I obtained are even lower than yours 82.7% and 77.3%. Could you share your test script file? Thank you!
@manmushanhe Hello, I also encountered some problems when reproducing the sales results of this paper. Can you share your test code? Thank you very much.
@JYX1216 Please wait a moment. I will make a repository for reproduction public.
@Shaokang-Agent @JYX1216 Open-GUI-Agent
@manmushanhe Thank you very much for sharing the test code. It will be very helpful for me.
Hello, your work is very helpful, but the repository link you shared is no longer valid. Could you please share it again?