UI-TARS v1.5版本的7B模型在element_ocr场景下大幅低于v1版本的2B模型，是否符合预期

使用https://github.com/VisualWebBench/VisualWebBench

进行评估，结果如下,其中element_ocr这个子任务，2B模型反而好很多，因为论文中没有展示各个子任务的精度，想确认下是否符合预期

2B模型 Model: , Task: web_caption, Scores: rouge_1: 4.54, rouge_2: 1.10, rouge_l: 4.15 Model: , Task: heading_ocr, Scores: rouge_1: 68.94, rouge_2: 67.12, rouge_l: 68.94 Model: , Task: element_ocr, Scores: rouge_1: 94.35, rouge_2: 93.10, rouge_l: 94.35 Model: , Task: action_prediction, Scores: accuracy: 5.34 Model: , Task: element_ground, Scores: accuracy: 93.70

7B模型 Model: , Task: web_caption, Scores: rouge_1: 25.71, rouge_2: 7.17, rouge_l: 23.22 Model: , Task: heading_ocr, Scores: rouge_1: 72.27, rouge_2: 68.17, rouge_l: 72.27 Model: , Task: element_ocr, Scores: rouge_1: 78.47, rouge_2: 75.70, rouge_l: 78.19 Model: , Task: action_prediction, Scores: accuracy: 16.73 Model: , Task: element_ground, Scores: accuracy: 93.70

May 09 '25 07:05 zhangyu68

您好，虽然我还没有在您的这个[WebBench] 上测试但是我在ScreenSpot数据集上测试 1.5版本也是差好多，还有想问下您 2B模型您有没有在ScreenSpot这个数据集上测试呢如果有您的结果和官方的结果接近吗官方结果如下图

May 09 '25 07:05 chuheww

https://github.com/bytedance/UI-TARS/blob/main/README_v1.md#local-deployment-vllm

May 09 '25 07:05 chuheww

您好，虽然我还没有在您的这个[WebBench] 上测试但是我在ScreenSpot数据集上测试 1.5版本也是差好多，还有想问下您 2B模型您有没有在ScreenSpot这个数据集上测试呢如果有您的结果和官方的结果接近吗官方结果如下图

可以请问一下您的部署和坐标后处理是如何实现的嘛

May 10 '25 03:05 JjjFangg

您好，虽然我还没有在您的这个[WebBench] 上测试但是我在ScreenSpot数据集上测试 1.5版本也是差好多，还有想问下您 2B模型您有没有在ScreenSpot这个数据集上测试呢如果有您的结果和官方的结果接近吗官方结果如下图

可以请问一下您的部署和坐标后处理是如何实现的嘛

我的部署是用transformer库部署的，坐标后处理是直接用的官方的后处理方法这两个在issues中都有人提到并且有相应的地址，目前看来可能1.5版本的后处理是有问题的

May 11 '25 06:05 chuheww

您有测试过官方的tutorial嘛在实际使用的时候需要确保输入模型的分辨率和后处理时完全一致（因为1.5采用的是绝对坐标，所以分辨率不一致影响会很大这和1.0有比较大的差别）

May 12 '25 01:05 JjjFangg

您有测试过官方的tutorial嘛在实际使用的时候需要确保输入模型的分辨率和后处理时完全一致（因为1.5采用的是绝对坐标，所以分辨率不一致影响会很大这和1.0有比较大的差别）

好的感谢您的回复我解决了这个问题。除此之外，我的另外一个问题是2B模型在ScreenSpot数据集上的测试
Mobile-Text | Mobile-Icon/Widget | Desktop-Text | Desktop-Icon/Widget | Web-Text | Web-Icon/Widget | Avg 分别为 93.0 | 75.5 | 90.7 | 68.6 | 84.3 | 74.8 | 82.3
但是我的测试为 79.12% (216/273) | 45.98% (103/224) | 71.13% (138/194) | 35.51% (49/138) | 74.24% (170/229) | 50.00% (103/206)（没有算最后的avg） prompt用的是codes/prompt.py下的GROUNDING 请问下是哪里还有问题吗

May 12 '25 03:05 chuheww

如果这里所说的 2B 模型指的是 UI-TARS-2B-SFT，可以尝试使用如下 prompt：

<image> Output only the coordinate of one point in your response. What element matches the following task: User Instruction

⚠️ 请注意：该模型不是 1.5 版本，因此其输出的坐标是 0–1000 范围内的相对坐标，而不是图像中的绝对像素位置。

May 13 '25 09:05 JjjFangg

如果这里所说的 2B 模型指的是 UI-TARS-2B-SFT，可以尝试使用如下 prompt：

<image> Output only the coordinate of one point in your response. What element matches the following task: User Instruction

⚠️ 请注意：该模型不是 1.5 版本，因此其输出的坐标是 0–1000 范围内的相对坐标，而不是图像中的绝对像素位置。

您好，UI-TARS-2B-SFT模型使用<image> Output only the coordinate of one point in your response. What element matches the following task: User Instruction 这个prompt 在screenspot得出的结果与上面相近，依然无法达到论文测试结果中的数值

May 14 '25 01:05 chuheww

您方便提供一下推理参数嘛建议使用greedy推理来评测哈

May 14 '25 06:05 JjjFangg

您方便提供一下推理参数嘛建议使用greedy推理来评测哈

您好，感谢您的回复我是初学者，给您或许带来了一些回答上的干扰，我直接贴上我的测试代码，希望您可以给予修改意见初始化方面
def init( self, model_path="./UI-TARS-2B-SFT", device_map="auto", ): self.tokenizer = AutoTokenizer.from_pretrained( model_path, trust_remote_code=True, use_fast=True )

    self.processor = AutoProcessor.from_pretrained(
        model_path,
        trust_remote_code=True,
        use_fast=True
    )

    self.model = AutoModelForVision2Seq.from_pretrained(
        model_path,
        device_map=device_map,
        trust_remote_code=True
    ).eval()                                                                                                                                                                                                                                                                                                                                                                       输入方面                                                                                                                                                                                                                                                                                                                                                                            formatted_prompt = GROUNDING.format(
        # language=language,
        instruction,
        instruction=instruction
    )                                                                                                                                                                                                                                                                                                                                                                                         messages = [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image},
                {"type": "text", "text": formatted_prompt}
            ]
        }
    ]
    text = self.processor.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )                                                                                                                                                                                                                                                                                                                                                                                         inputs = self.build_multimodal_input(
        image=image,
        instruction=instruction,
        # language=language,
    )

    generated_ids = self.model.generate(
        **inputs,
        max_new_tokens=200,
        pad_token_id=self.tokenizer.pad_token_id,
        do_sample=False,
        num_beams =1,
        eos_token_id=self.tokenizer.eos_token_id
    )

    generated_ids_trimmed = [
        out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]

解析结构化动作方面 response = self.processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0]
model_type = "qwen2vl" mock_response_dict = parse_action_to_structure_output(response, 1000, original_image_height, original_image_width, model_type) parsed_pyautogui_code = parsing_response_to_pyautogui_code(mock_response_dict, original_image_height, original_image_width)
最新的UI-TARS-2B-SFT 模型使用上面的脚本测试结果为
Mobile-Text | Mobile-Icon/Widget | Desktop-Text | Desktop-Icon/Widget | Web-Text | Web-Icon/Widget
82.78% (226/273) | 51.98% (118/227) || 77.20% (149/193) | 43.17% (60/139) || 76.86% (176/229) | 55.83% (115/206) 谢谢您的解答

May 14 '25 07:05 chuheww

建议按如下方式进行初始化哈（具体可以参考Qwen2vl的官方推理教程）

min_pixels = 2562828 max_pixels = 13402828 processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels) model = Qwen2VLForConditionalGeneration.from_pretrained( model_path, torch_dtype="auto", device_map="auto" )

May 15 '25 03:05 JjjFangg

@JjjFangg 请问用 vllm 启动 ByteDance-Seed/UI-TARS-1.5-7B，有什么建议吗？目前也是遇到了定位不准的问题。

May 15 '25 08:05 NEOOOOOOOOOO

@NEOOOOOOOOOO 这个是不是可以在调用模型前对图像进行resize，vllm启动的只是模型，具体调用的时候可以先做一下预处理。

May 19 '25 09:05 six-wood

@JjjFangg 请问用 vllm 启动 ByteDance-Seed/UI-TARS-1.5-7B，有什么建议吗？目前也是遇到了定位不准的问题。

建议参考Qwen2.5vl官方的教程进行推理，坐标后处理可以参考这个tutorial

May 29 '25 06:05 JjjFangg

您有测试过官方的tutorial嘛在实际使用的时候需要确保输入模型的分辨率和后处理时完全一致（因为1.5采用的是绝对坐标，所以分辨率不一致影响会很大这和1.0有比较大的差别）

你好，我对这个绝对坐标有一点疑问：一、我使用如下代码测出来结果比较接近 `model = Qwen2_5_VLForConditionalGeneration.from_pretrained( qwen_path, torch_dtype=torch.bfloat16, # attn_implementation="flash_attention_2", device_map="cuda" )

processor = AutoProcessor.from_pretrained(qwen_path)

image = Image.open(img_path)

inputs = processor( text=[text], images=[image], padding=True, return_tensors="pt", ).to('cuda')

output_ids = model.generate(**inputs, max_new_tokens=1024)

generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)] output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True) input_height = inputs['image_grid_thw'][0][1] * 14 input_width = inputs['image_grid_thw'][0][2] * 14这个代码是参考QWEN2.5-vl官方指定的坐标转换写的，跑Qwen2.5没问题，跑UI-TAR-1.5也能跑，但是1.5效果要比论文中的2B效果差一点点INFO:root:[[0.9551724137931035, 0.8483412322274881], [0.9381443298969072, 0.8071428571428572], [0.9145299145299145, 0.8374384236453202]]` ，更达不到94.2，这是为什么呢？

二、我用了https://github.com/bytedance/UI-TARS/blob/main/README_coordinates.md中的，直接跑出来几乎全是错误，缩放后的坐标差距较大。请问以上两点是哪里做的不对吗？期待您的答复，谢谢。

Jun 25 '25 02:06 yikangshao

使用https://github.com/VisualWebBench/VisualWebBench

进行评估，结果如下,其中element_ocr这个子任务，2B模型反而好很多，因为论文中没有展示各个子任务的精度，想确认下是否符合预期

2B模型 Model: , Task: web_caption, Scores: rouge_1: 4.54, rouge_2: 1.10, rouge_l: 4.15 Model: , Task: heading_ocr, Scores: rouge_1: 68.94, rouge_2: 67.12, rouge_l: 68.94 Model: , Task: element_ocr, Scores: rouge_1: 94.35, rouge_2: 93.10, rouge_l: 94.35 Model: , Task: action_prediction, Scores: accuracy: 5.34 Model: , Task: element_ground, Scores: accuracy: 93.70

7B模型 Model: , Task: web_caption, Scores: rouge_1: 25.71, rouge_2: 7.17, rouge_l: 23.22 Model: , Task: heading_ocr, Scores: rouge_1: 72.27, rouge_2: 68.17, rouge_l: 72.27 Model: , Task: element_ocr, Scores: rouge_1: 78.47, rouge_2: 75.70, rouge_l: 78.19 Model: , Task: action_prediction, Scores: accuracy: 16.73 Model: , Task: element_ground, Scores: accuracy: 93.70

这个问题后来有解决吗？我测试UI-TARS-1.5-7B模型，结果也比较低。

Aug 07 '25 05:08 little-pikachu