MobileAgent icon indicating copy to clipboard operation
MobileAgent copied to clipboard

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception


Junyang Wang1, Haiyang Xu2†, Jiabo Ye2, Ming Yan2†,
Weizhou Shen2, Ji Zhang2, Fei Huang2, Jitao Sang1†
{junyangwang, jtsang}@bjtu.edu.cn, {shuofeng.xhy, ym119608}@alibaba-inc.com

1Beijing Jiaotong University 2Alibaba Group
Corresponding author

📋Introduction

  • Pure visual solution, independent of XML and system metadata.
  • Unrestricted operation scope, capable of multi-app operations.
  • Multiple visual perception tools for operation localization.
  • No need for exploration and training, plug and play.

📢News

  • [2.5] 🔥🔥We provide a free API and deploy the entire process for experiencing Mobile Agent, even if you don't have an OpenAI API Key. Check out Quick Start.
  • [2.2] 🔥We are deploying the demo based on Gradio and users will be able to upload the screenshots.
  • [1.31] 🔥Our code is available! Welcome to try Mobile-Agent.
  • [1.31] 🔥Human-operated data in Mobile-Eval is in preparation and will be open-sourced soon.
  • [1.30] Our paper is available at LINK.
  • [1.30] Our evaluation results on Mobile-Eval are available.
  • [1.30] The code and Mobile-Eval benchmark are coming soon!

📺Demo

https://github.com/X-PLUG/MobileAgent/assets/127390760/26c48fb0-67ed-4df6-97b2-aa0c18386d31

🔧Preparation

Installation

git clone https://github.com/X-PLUG/MobileAgent.git
cd MobileAgent
pip install -r requirements.txt

Preparation for Connecting Mobile Device

  1. Download the Android Debug Bridge.
  2. Turn on the ADB debugging switch on your Android phone, it needs to be turned on in the developer options first.
  3. Connect your phone to the computer with a data cable and select "Transfer files".
  4. Test your ADB environment as follow: /path/to/adb devices. If the connected devices are displayed, the preparation is complete.
  5. If you are using a MAC or Linux system, make sure to turn on adb permissions as follow: sudo chmod +x /path/to/adb
  6. If you are using Windows system, the path will be xx/xx/adb.exe

🔧Quick Start

Note

❗Since the GPT-4V will have severe hallucinations when perceiving non-English screenshots, we strongly recommend using Mobile-Agent under English-only systems and apps to ensure the performance. ❗Due to current limited resources, please contact us to get a free API Key consisting of a url and a token.

Run

python run_api.py --adb_path /path/to/adb --url "The url you got" --token "The token you got" --instruction "your instruction"

🔧Getting Started with your own API Key

Preparation for Visual Perception Tools

  1. Download the icon detection model Grounding DION
  2. The text detection model will be automatically downloaded from modelscope after you run Mobile-Agent.

Run

python run.py --grounding_ckpt /path/to/GroundingDION --adb_path /path/to/adb --api "your API_TOKEN" --instruction "your instruction"

API_TOKEN is an API Key from OpenAI with the permission to access gpt-4-vision-preview.

📱Mobile-Eval

Mobile-Eval is a benchmark designed for evaluating the performance of mobile device agents. This benchmark includes 10 mainstream single-app scenarios and 1 multi-app scenario.

For each scenario, we have designed three instructions:

  • Instruction 1: relatively simple and basic task
  • Instruction 2: additional requirements added on top of the difficulty of Instruction 1
  • Instruction 3: user demands with no explicit task indication

The detailed content of Mobile-Eval is as follows:

Application Instruction
Alibaba.com 1. Help me find caps in Alibaba.com.
2. Help me find caps in Alibaba.com. If the "Add to cart" is available in the item information page, please add the item to my cart.
3. I want to buy a cap. I've heard things are cheap on Alibaba.com. Maybe you can find it for me.
Amazon Music 1. Search singer Jay Chou in Amazon Music.
2. Search a music about "agent" in Amazon Music and play it.
3. I want to listen music to relax. Find an App to help me.
Chrome 1. Search result for today's Lakers game.
2. Search the information about Taylor Swift.
3. I want to know the result for today's Lakers game. Find an App to help me.
Gmail 1. Send an empty email to to {address}.
2. Send an email to {address}n to tell my new work.
3. I want to let my friend know my new work, and his address is {address}. Find an App to help me.
Google Maps 1. Navigate to Hangzhou West Lake.
2. Navigate to a nearby gas station.
3. I want to go to Hangzhou West Lake, but I don't know the way. Find an App to help me.
Google Play 1. Download WhatsApp in Play Store.
2. Download Instagram in Play Store.
3. I want WhatsApp on my phone. Find an App to help me.
Notes 1. Create a new note in Notes.
2. Create a new note in Notes and write "Hello, this is a note", then save it.
3. I suddenly have something to record, so help me find an App and write down the following content: meeting at 3pm.
Settings 1. Turn on the dark mode.
2. Turn on the airplane mode.
3. I want to see the real time internet speed at the battery level, please turn on this setting for me.
TikTok 1. Swipe a video about pet cat in TikTok and click a "like" for this video.
2. Swipe a video about pet cat in TikTok and comment "Ohhhh, so cute cat!".
3. Swipe videos in TikTok. Click "like" for 3 pet video cat.
YouTube 1. Search for videos about Stephen Curry on YouTube.
2. Search for videos about Stephen Curry on YouTube and open "Comments" to comment "Oh, chef, your basketball spirit has always inspired me".
3. I need you to help me show my love for Stephen Curry on YouTube.
Multi-App 1. Open the calendar and look at today's date, then go to Notes and create a new note to write "Today is {today's data}".
2. Check the temperature in the next 5 days, and then create a new note in Notes and write a temperature analysis.
3. Search the result for today's Lakers game, and then create a note in Notes to write a sport news for this result.

📝Evaluation results

We evaluated Mobile-Agent on Mobile-Eval. The evaluation results are available at LINK.

  • We have stored the evaluation results for the 10 apps and the multi-app scenario in folders named after each app.
  • The numbers within each app's folder represent the results for different types of instruction within that app.
  • For example, if you want to view the results of Mobile-Agent for the second instruction in Google Maps, you should go to the following path:results/Google Maps/2.
  • If the last action of Mobile-Agent is not "stop", it indicates that Mobile-Agent did not complete the corresponding instruction. During the evaluation, we manually terminated these cases where completion was not possible.

📄To-do List

  • Development of Mobile-Agent app on Android platform.
  • Adaptation to other mobile device platforms.

📑Citation

If you find Mobile-Agent useful for your research and applications, please cite using this BibTeX:

@article{wang2024mobile,
  title={Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception},
  author={Wang, Junyang and Xu, Haiyang and Ye, Jiabo and Yan, Ming and Shen, Weizhou and Zhang, Ji and Huang, Fei and Sang, Jitao},
  journal={arXiv preprint arXiv:2401.16158},
  year={2024}
}

📦Related Projects