Is Android World evaluated the same way as OS World SHOWED IN osworld.py?
Is Android World evaluated the same way as OS World?
I have tested UI-TARS-7B-DPO and UI-TARS-72B-DPO on Android World, and they scored 26.7 and 35.7, respectively, compared with 33 and 46.6 in the paper.
my test tips: i. i use Huawei Ascend NPU ii. 4 historical images + 1 current image (all uncompressed) iii. All historical actions and thoughts
Moreover, i used the same way to test UI-TARS-1.5-7B, the score is only 16.9. Then I used two tricks to get the score to 28.4(ALSO FAR FROM THE LASTED 64.2 SCORE!!!): i. Modify the prompt to srocll down to guide the search for the app list. ii. Turn off the gear logo at the main page header.
Is this normal?
Hi, I also tested UI-TARS-1.5-7B on the Android World task set (android) and achieved a score of 7%. Using the same setup, UI-TARS-7B-SFT reached 30%.
My configuration:
- Resized input image to a maximum of 720×28×28.
- 15 steps of full history (actions, thoughts, and screenshots).
- Prompt
"""You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.
## Output Format
Thought: ... Action: ...
## Action Space
click(start_box='<|box_start|>(x1,y1)<|box_end|>')
long_press(start_box='<|box_start|>(x1,y1)<|box_end|>', time='')
type(content='') #If you want to submit your input, use "\\n" at the end of `content`.
scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>')
press_home()
press_back()
open_app(content='') # Open an app specified by `content`.
finished(content='') # Submit the task regardless of whether it succeeds or fails.
## Note
- Use English in `Thought` and `Action` part.
- Write a small plan and finally summarize your next action (with its target element) in one sentence in `Thought` part.
## User Instruction
{instruction}
"""
Could you share the prompt details used in your setup?
Hello, I am preparing to conduct tests on Android World and would like to ask a question. For online benchmarks like Android World, if my GUI Agent model is deployed on a remote server that does not have a graphical user interface and can only be accessed via the command line, is it possible to run the tests? If not, could you please share what kind of devices you used for testing on Android World?
Hi, I also tested UI-TARS-1.5-7B on the Android World task set (android) and achieved a score of 7%. Using the same setup, UI-TARS-7B-SFT reached 30%.
My configuration:
- Resized input image to a maximum of 720×28×28.
- 15 steps of full history (actions, thoughts, and screenshots).
- Prompt
"""You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task. ## Output FormatThought: ... Action: ...
## Action Space click(start_box='<|box_start|>(x1,y1)<|box_end|>') long_press(start_box='<|box_start|>(x1,y1)<|box_end|>', time='') type(content='') #If you want to submit your input, use "\\n" at the end of `content`. scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>') press_home() press_back() open_app(content='') # Open an app specified by `content`. finished(content='') # Submit the task regardless of whether it succeeds or fails. ## Note - Use English in `Thought` and `Action` part. - Write a small plan and finally summarize your next action (with its target element) in one sentence in `Thought` part. ## User Instruction {instruction} """Could you share the prompt details used in your setup?
My implementation also only produce 7% success rate using UI-TARS-1.5-7B following the provided prompt template
Hi @icarus0309, can you elaborate in this please? i. Modify the prompt to srocll down to guide the search for the app list. ii. Turn off the gear logo at the main page header.
Thank you!