UI-TARS 怎么让模型遵循Touch滑动指令，现在看来向左/向右的指令输出是一样的

同一个屏幕截图，scroll let, scroll right的两个指令输出都是scroll left，怎么解决？ instruction: scroll left screenshot_path: screen1.jpeg 从屏幕布局来看，这是一个手机的主屏幕界面。根据任务要求，我需要向左滑动屏幕以显示更多应用程序图标。在主屏幕上，用户可以通过滑动操作来访问更多的应用程序和功能。为了完成向左滑动的操作，我需要将手指放在屏幕上并进行水平方向的滑动。这样可以显示更多的应用程序图标和小部件。 Action: scroll(start_box='(697,321)', end_box='(0,354)')

instruction: scroll right screenshot_path: screen1.jpeg 从屏幕内容来看，这是一个手机的主屏幕界面。根据任务要求，我需要向右滑动屏幕来查看更多应用程序或信息。在手机界面上，向右滑动是常见的操作方式，可以显示更多的应用程序图标或小部件。 Action: scroll(start_box='(702,436)', end_box='(159,438)')

System prompt:
## Below is the prompt for mobile
prompt = r"""You are a mobile GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task. 

## Output Format
```\nThought: ...
Action: ...\n```

## Action Space
click(start_box='<|box_start|>(x1,y1)<|box_end|>')
long_press(start_box='<|box_start|>(x1,y1)<|box_end|>', time='')
type(content='')
scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>')
press_home()
press_back()
finished(content='') # Submit the task regardless of whether it succeeds or fails.

## Note
- Use Chinese in `Thought` part.

- Write a small plan and finally summarize your next action (with its target element) in one sentence in `Thought` part.

## User Instruction
"""

Apr 02 '25 03:04 TSHunterY

scroll up & scroll down, 输出也都是scroll up. instruction: scroll up screenshot_path: screen3.jpeg 从屏幕顶部的状态栏向下滑动，以打开通知中心查看快速设置和通知。 Action: scroll(start_box='(486,907)', end_box='(453,216)')

instruction: scroll down screenshot_path: screen3.jpeg 从当前屏幕可以看到，我正在设置页面中，需要向下滚动查看更多设置选项。根据任务要求，我需要执行向下滚动的操作。在设置页面中，为了查看更多的设置选项，我需要使用屏幕的下拉功能。通过向上滑动屏幕来实现向下滚动的效果。 Action: scroll(start_box='(504,897)', end_box='(469,321)')

Apr 02 '25 03:04 TSHunterY

请问您是怎么把流程串起来的？是需要自己写代码负责执行和上传图片吗？

Apr 15 '25 08:04 ArtificialIdoit

请问您是怎么把流程串起来的？是需要自己写代码负责执行和上传图片吗？

是的，需要写一段代码的

Apr 18 '25 02:04 TSHunterY

您好，有些细节想咨询您一下，如果想让任务自动化开始，是需要有前置的代码逻辑来启动应用、截屏等操作，然后再提交给模型处理吗？然后作者的论文提到了归化坐标，那么最后具体要操作设备，应该是基于adb 指令？基于模型返回的结构化的Action然后处理成操作设备的代码？

Jun 04 '25 15:06 yuan0818