UI-TARS-1.5-7B would not output bounding box

Open mearcstapa-gqz opened this issue 8 months ago • 1 comments

I notice that for UI-TARS-1.5-7B, the model would not output bounding box for an element, even with explicit prompt asking for a bounding box.

Is it because the training of UI-TARS-1.5-7B uses the point instead of bbox, extensively and exclusively?

I notice a change in the format in the prompt used.

click(start_box='[x1, y1, x2, y2]') https://github.com/xlang-ai/OSWorld/commit/0bc1e084400e101848dcc48893bf24a0f9e6db2f https://github.com/bytedance/UI-TARS-desktop/blob/fba1e6bd6de2520043ee1b07a05be2e9f23d1e9a/packages/ui-tars/sdk/src/constants.ts

click(start_box='<|box_start|>(x1,y1)<|box_end|>') https://github.com/bytedance/UI-TARS-desktop/blob/main/apps/ui-tars/src/main/agent/prompts.ts https://github.com/bytedance/UI-TARS/blob/main/prompts.py

Apr 18 '25 19:04 mearcstapa-gqz

Yes, UI-TARS-1.5-7B has been trained to allow output only in the form of points.

Apr 21 '25 04:04 JjjFangg