UI-TARS Prompt format for multi-step set up

Hi there,

Congratulations on the great work! I'm curious how should one format the prompt in agent evaluation? i.e. when there are multiple turns of user provided observations and agent actions. Currently I tried the format below and tested a few tasks on OSWorld, however the results don't look good. The PROMPT_FOR_COMPUTER is just the prompt provided in the readme. So basically I only used the most recent one screenshot and condensed all history actions in the user turn as well.

previous_actions = "\n".join([f"Step {i+1}: {action}" for i, action in enumerate(self.actions)]) if self.actions else "None"
messages = []
messages.append({
    "role": "system",
    "content": [{"type": "text", "text": "You are a helpful assistant."}]
})
messages.append({
    "role": "user",
    "content": [
        {
            "type": "text",
            "text": PROMPT_FOR_COMPUTER + f"{instruction}\nPrevious Actions:\n{previous_actions}" )
        },
        {
            "type": "image_url",
            "image_url": {"url": f"data:image/png;base64,{encode_image(obs['screenshot'])}"}
        }
    ],
})

Could you please share some insights here? Thank you!

Jan 23 '25 07:01 Mayer123

Congrats on the great work and thanks for the comments. When trying the prompt format above, the 72B DPO model complains that "More than 1 image is unsupported". Could you kindly comment on this?

Jan 23 '25 19:01 llajan

Hi @llajan

Did you try:

python -m vllm.entrypoints.openai.api_server --served-model-name ui-tars --model <path to your model> --limit-mm-per-prompt image=5 -tp <tp>

from: https://github.com/bytedance/UI-TARS?tab=readme-ov-file#start-an-openai-api-service

Jan 26 '25 17:01 korbinian-hoermann

That seems to do the job. Thank you!

Jan 27 '25 14:01 llajan

Hi @pooruss,

thank you for the pseudocode. I followed it to construct my messages. The first one is always the PROMPT_FOR_COMPUTER + instruction. This is followed by the history (previous screenshots and corresponding actions). The last entry is the most recent screenshot.

Here is a visualization of the outgoing message object at timestep 2. The task in this example is to search for cat images, the starting website is google.com

Here is a visualization of the actions 0-2:

Step 0:

Step 1:

Step 2:

In step 0, the action taken seems to be grounded in the most recent (and only) screenshot. In step 1 and 2 on the other hand, the agent does not seem to identify the last screenshot as the most recent one, but it is referencing GUI elements from previous time steps.

I'd be happy if you could answer the following questions:

Is the order of messages correct ?
Did you experience similar behavior at your side?
Do you use a different prompt issue 32
Or am I missing something else ?

As recommended, I was using UI-TARS-7B-DPO for inference. Switching to the SFT version, increases expected behavior and task completion rate. Is there an explanation for that ?

Feb 06 '25 18:02 korbinian-hoermann

如果多轮对话大于5，maxImage=5的话，是取最近的5张图片，不理init吗？

Feb 10 '25 09:02 mangoyuan

Hi @pooruss,

thank you for the pseudocode. I followed it to construct my messages. The first one is always the PROMPT_FOR_COMPUTER + instruction. This is followed by the history (previous screenshots and corresponding actions). The last entry is the most recent screenshot.

Here is a visualization of the outgoing message object at timestep 2. The task in this example is to search for cat images, the starting website is google.com

Here is a visualization of the actions 0-2:

Step 0:

Step 1:

Step 2:

In step 0, the action taken seems to be grounded in the most recent (and only) screenshot. In step 1 and 2 on the other hand, the agent does not seem to identify the last screenshot as the most recent one, but it is referencing GUI elements from previous time steps.

I'd be happy if you could answer the following questions:

Is the order of messages correct ?

Did you experience similar behavior at your side?

Do you use a different prompt issue 32

Or am I missing something else ?

As recommended, I was using UI-TARS-7B-DPO for inference. Switching to the SFT version, increases expected behavior and task completion rate. Is there an explanation for that ?

Hi there, thanks for bringing this problem out. There are some problems in the previous pseudocode, we have updated our infer codes in https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/uitars_agent.py, check if it helps with the problem.

Feb 10 '25 09:02 pooruss

如果多轮对话大于5，maxImage=5的话，是取最近的5张图片，不理init吗？

是的

Feb 10 '25 09:02 pooruss

@pooruss There are some possible bugs or I might mis-understand the code in https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/uitars_agent.py

Try to fix on my side but not sure that's correct

Fix 1: Line 636 Since history_response could be a string, I changed sth like this:

if type(history_response) == str:
            messages.append({
                "role": "assistant",
                "content": [{"type": "text", "text": history_response}]
            })
        else:
            messages.append({
                "role": "assistant",
                "content": [history_response]
            })

In line 677, I believe response.choices[0].message.content.strip() should be a string, and this what I did.

prediction = response.choices[0].message.content.strip()
                
# prediction = response[0]["prediction"].strip()

After these modifications, the code can run very well. However, I found start from second turn, the action generated looks messed frequently.

response: 
为了移除Amazon保存的追踪信息，我需要进入Chrome的隐私设置页面。从当前菜单中可以看到"Settings"选项。应该点击菜单中的"Settings"选项，这将打开Chrome的设置页面，在那里我可以管理网站权限和清除浏览数据。
Action: click(start_box='='
             
)

The mistake is kind of consistent which is has a equal sign in parameters.

Here is an example for my prompt

[
    {
        "role": "system",
        "content": [
            {
                "type": "text",
                "text": "You are a helpful assistant."
            }
        ]
    },
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task. \n\n## Output Format\n```\nThought: ...\nAction: ...\n```\n\n## Action Space\n\nclick(start_box='[x1, y1, x2, y2]')\nleft_double(start_box='[x1, y1, x2, y2]')\nright_single(start_box='[x1, y1, x2, y2]')\ndrag(start_box='[x1, y1, x2, y2]', end_box='[x3, y3, x4, y4]')\nhotkey(key='')\ntype(content='') #If you want to submit your input, use \"\\n\" at the end of `content`.\nscroll(start_box='[x1, y1, x2, y2]', direction='down or up or right or left')\nwait() #Sleep for 5s and take a screenshot to check for any changes.\nfinished()\ncall_user() # Submit the task and call the user when the task is unsolvable, or when you need the user's help.\n\n\n## Note\n- Use Chinese in `Thought` part.\n- Write a small plan and finally summarize your next action (with its target element) in one sentence in `Thought` part.\n\n## User Instruction\nHey, I need a quick way back to this site. Could you whip up a shortcut on my desktop for me?\n"
            }
        ]
    },
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": "<image>"
            }
        ]
    },
    {
        "role": "assistant",
        "content": [
            {
                "type": "text",
                "text": "Thought: 根据任务要求，我需要在桌面上创建一个快捷方式以便快速访问当前的mathisfun.com网站。在Chrome浏览器中，我可以通过右键点击地址栏中的网址，然后选择\"Copy link location\"选项来复制当前网址。这样就可以将网址复制到剪贴板，之后再通过右键点击桌面空白处创建快捷方式。\nAction: right_single(start_box='(324,218)')"
            }
        ]
    },
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": "<image>"
            }
        ]
    },
    {
        "role": "assistant",
        "content": [
            {
                "type": "text",
                "text": "之前我已经右键点击了地址栏，现在我需要点击\"Copy link location\"选项来复制当前网址。从截图中可以看到该选项在菜单中已经显示出来了。我应该移动鼠标到\"Copy link location\"选项的位置，然后左键单击这个选项来复制网址。\nAction: click(start_box='(373,356)')"
            }
        ]
    },
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": "<image>"
            }
        ]
    }
]

The second message and third message's role are always user, in which their content could be possible combined together. It looks a little bit wired to me.

Could you please suggest if I'm using the code correctly?

Feb 27 '25 04:02 bofei5675