UI-TARS icon indicating copy to clipboard operation
UI-TARS copied to clipboard

Function of the Element Description & Dense Captioning steps 元素描述、页面描述等增强感知步骤的作用

Open shuiyigt opened this issue 8 months ago • 0 comments

Paper shows several steps about enhance perception, containing element descption, dense captioning, state transition captioning, SoM, QA, grounding.

I wonder if the FIRST THREE steps are only trained for adjusting model parameters , not really used in the following process. Or can be used in the Agent instrunctions?

我想了解一下关于增强感知步骤中,比如元素描述、页面描述、前后页面差异描述等部分,只是作为单独的任务去训练,为了提升模型多方面理解能力。还是在Agent实际做Computer/Mobile use时,也会把这几项信息加入到instruction中,让大模型有更直接的背景知识信息? 我看提供的Demo似乎没有提到会用作prompt,但是不是加入这些信息会更利于做出判断呢?尤其是System-2 reasoning时?

shuiyigt avatar Apr 21 '25 13:04 shuiyigt