Integrate Set-of-Mark Visual Prompting for GPT-4V
I noticed that you currently seem to apply a grid to the images to assist the vision model:
- https://github.com/OthersideAI/self-operating-computer/blob/main/operate/main.py#L462-L527
And mention this in the README:
Current Challenges Note: The GPT-4v's error rate in estimating XY mouse click locations is currently quite high. This framework aims to track the progress of multimodal models over time, aspiring to achieve human-level performance in computer operation.
I was wondering, have you looked at using Set-of-Mark Prompting Visual Prompting for GPT-4V / similar techniques?
See Also
A bit of a link dump from one of my references:
- https://github.com/microsoft/SoM
-
Set-of-Mark Prompting for LMMs
-
Set-of-Mark Visual Prompting for GPT-4V
-
We present Set-of-Mark (SoM) prompting, simply overlaying a number of spatial and speakable marks on the images, to unleash the visual grounding abilities in the strongest LMM -- GPT-4V. Let's using visual prompting for vision!
- https://arxiv.org/abs/2310.11441
-
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
-
We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we employ off-the-shelf interactive segmentation models, such as SEEM/SAM, to partition an image into regions at different levels of granularity, and overlay these regions with a set of marks e.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V can answer the questions that require visual grounding. We perform a comprehensive empirical study to validate the effectiveness of SoM on a wide range of fine-grained vision and multimodal tasks. For example, our experiments show that GPT-4V with SoM in zero-shot setting outperforms the state-of-the-art fully-finetuned referring expression comprehension and segmentation model on RefCOCOg. Code for SoM prompting is made public at: this https URL.
-
- https://github.com/facebookresearch/segment-anything
-
Segment Anything
-
The repository provides code for running inference with the SegmentAnything Model (SAM), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.
-
The Segment Anything Model (SAM) produces high quality object masks from input prompts such as points or boxes, and it can be used to generate masks for all objects in an image. It has been trained on a dataset of 11 million images and 1.1 billion masks, and has strong zero-shot performance on a variety of segmentation tasks.
-
- https://github.com/UX-Decoder/Semantic-SAM
-
Official implementation of the paper "Semantic-SAM: Segment and Recognize Anything at Any Granularity"
-
In this work, we introduce Semantic-SAM, a universal image segmentation model to enable segment and recognize anything at any desired granularity. We have trained on the whole SA-1B dataset and our model can reproduce SAM and beyond it.
-
Segment everything for one image. We output controllable granularity masks from semantic, instance to part level when using different granularity prompts.
-
- https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once
-
SEEM: Segment Everything Everywhere All at Once
-
[NeurIPS 2023] Official implementation of the paper "Segment Everything Everywhere All at Once"
-
We introduce SEEM that can Segment Everything Everywhere with Multi-modal prompts all at once. SEEM allows users to easily segment an image using prompts of different types including visual prompts (points, marks, boxes, scribbles and image segments) and language prompts (text and audio), etc. It can also work with any combination of prompts or generalize to custom prompts!
-
- https://github.com/IDEA-Research/GroundingDINO
-
Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
-
- https://github.com/IDEA-Research/OpenSeeD
-
[ICCV 2023] Official implementation of the paper "A Simple Framework for Open-Vocabulary Segmentation and Detection"
-
- https://github.com/IDEA-Research/MaskDINO
-
[CVPR 2023] Official implementation of the paper "Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation"
-
- https://github.com/facebookresearch/VLPart
-
[ICCV2023] VLPart: Going Denser with Open-Vocabulary Part Segmentation
-
Object detection has been expanded from a limited number of categories to open vocabulary. Moving forward, a complete intelligent vision system requires understanding more fine-grained object descriptions, object parts. In this work, we propose a detector with the ability to predict both open-vocabulary objects and their part segmentation.
-
-
- https://github.com/ddupont808/GPT-4V-Act
-
GPT-4V-Act: Chromium Copilot
-
AI agent using GPT-4V(ision) capable of using a mouse/keyboard to interact with web UI
-
GPT-4V-Act serves as an eloquent multimodal AI assistant that harmoniously combines GPT-4V(ision) with a web browser. It's designed to mirror the input and output of a human operator—primarily screen feedback and low-level mouse/keyboard interaction. The objective is to foster a smooth transition between human-computer operations, facilitating the creation of tools that considerably boost the accessibility of any user interface (UI), aid workflow automation, and enable automated UI testing.
-
GPT-4V-Act leverages both GPT-4V(ision) and Set-of-Mark Prompting, together with a tailored auto-labeler. This auto-labeler assigns a unique numerical ID to each interactable UI element.
By incorporating a task and a screenshot as input, GPT-4V-Act can deduce the subsequent action required to accomplish a task. For mouse/keyboard output, it can refer to the numerical labels for exact pixel coordinates.
- https://openai.com/research/gpt-4v-system-card
-
GPT-4V(ision)
-
- https://openai.com/research/gpt-4v-system-card
-
- https://github.com/Jiayi-Pan/GPT-V-on-Web
-
👀🧠 GPT-4 Vision x 💪⌨️ Vimium = Autonomous Web Agent
-
This project leverages GPT4V to create an autonomous / interactive web agent. The action space are discretized by Vimium.
-
See also:
- https://github.com/OthersideAI/self-operating-computer/issues/31
I am trying to implement SoM, since it seems to have the best accuracy.
@Daisuke134 interested to see what you find. I'm going to go learn more about SoM
@0xdevalias read up more on SoM. It looks like a very promising approach, thank you for opening this issue!
https://github.com/microsoft/SoM
read up more on SoM. It looks like a very promising approach, thank you for opening this issue!
@joshbickett No worries :)
I have been testing out SoM and seems pretty good. Here is the screenshot.. I will try adding this today, test it, and make PR.
I am implementing SoM now, and seems like the best way is to make another mode like som-mode and make a new prompt for the mode.
@Daisuke134 @0xdevalias Set-of-Mark prompting is now available. Swap in your best.pt from a YOLOv8 model and see how it performs!