OpenAdapt icon indicating copy to clipboard operation
OpenAdapt copied to clipboard

Implement model cursor for visual feedback

Open abrichr opened this issue 1 year ago â€ĸ 23 comments

Feature request

Update: see https://github.com/OpenAdaptAI/OpenAdapt/issues/760#issuecomment-2347337901 for the latest requirements.

We want to be able to give the model the ability to:

  1. paint a red dot on its suggested target location
  2. look at the screenshot with the dot on it,
  3. optionally self correct.

Thank you @LunjunZhang for the suggestion 🙏

This involves creating a CursorReplayStrategy (based on the VanillaReplayStrategy) that implements the required prompting.

Motivation

Correct errors, e.g. missed segmentations.

Possibly related: https://arxiv.org/abs/2406.09403:

Humans draw to facilitate reasoning: we draw auxiliary lines when solving geometry problems; we mark and circle when reasoning on maps; we use sketches to amplify our ideas and relieve our limited-capacity working memory. However, such actions are missing in current multimodal language models (LMs). Current chain-of-thought and tool-use paradigms only use text as intermediate reasoning steps. In this work, we introduce Sketchpad, a framework that gives multimodal LMs a visual sketchpad and tools to draw on the sketchpad. The LM conducts planning and reasoning according to the visual artifacts it has drawn. ... Sketchpad substantially improves performance on all tasks over strong base models with no sketching, yielding an average gain of 12.7% on math tasks, and 8.6% on vision tasks. GPT-4o with Sketchpad sets a new state of the art on all tasks, including V*Bench (80.3%), BLINK spatial reasoning (83.9%), and visual correspondence (80.8%). All codes and data are in this https URL.

abrichr avatar Jun 16 '24 12:06 abrichr

/bounty $1000

abrichr avatar Jun 17 '24 00:06 abrichr

💎 $1,000 bounty â€ĸ OpenAdaptAI

Steps to solve:

  1. Start working: Comment /attempt #760 with your implementation plan
  2. Submit work: Create a pull request including /claim #760 in the PR body to claim the bounty
  3. Receive payment: 100% of the bounty is received 2-5 days post-reward. Make sure you are eligible for payouts

❗ Important guidelines:

  • To claim a bounty, you need to provide a short demo video of your changes in your pull request
  • If anything is unclear, ask for clarification before starting as this will help avoid potential rework
  • Low quality AI PRs will not receive review and will be closed
  • Do not ask to be assigned unless you've contributed before

Thank you for contributing to OpenAdaptAI/OpenAdapt!

Attempt Started (UTC) Solution Actions
đŸŸĸ @blocator23 Aug 01, 2025, 08:21:18 AM #956 Reward
🔴 @Ahmadkhan02 Jul 02, 2024, 08:09:09 PM WIP
đŸŸĸ @onyedikachi-david Jul 04, 2024, 10:10:18 AM #823 Reward
đŸŸĸ @varshith257 Jul 04, 2024, 08:27:40 PM WIP
đŸŸĸ @stdthoth Sep 12, 2024, 08:37:31 PM WIP
đŸŸĸ @Amanullah1002 Jun 17, 2024, 03:18:43 AM WIP
🔴 @Subh231004 Jun 17, 2024, 06:29:42 AM WIP
🔴 @ Jun 17, 2024, 06:31:46 AM WIP
đŸŸĸ @hoklims Nov 19, 2024, 04:01:52 PM #923 Reward
đŸŸĸ @MAVRICK-1 Aug 19, 2025, 06:34:52 PM WIP
đŸŸĸ @TanCodeX May 29, 2025, 09:25:19 AM #952 Reward

algora-pbc[bot] avatar Jun 17 '24 00:06 algora-pbc[bot]

/attempt #760

Options

Subh231004 avatar Jun 17 '24 06:06 Subh231004

/attempt #760

Implementation Plan for Model Cursor Feedback (Issue #760) Create CursorReplayStrategy: I'll develop a new CursorReplayStrategy class extending VanillaReplayStrategy. Paint Red Dot: I'll implement a method to paint a red dot on the target location within a given image. Screenshot Capture: I'll implement a method to capture a screenshot and overlay the red dot on it. Self-Correction: I'll add an optional self-correction mechanism based on the screenshot with the dot. Testing: I'll write and execute unit tests to ensure the functionality works as intended. Documentation: I'll update the project documentation to include usage instructions for the new strategy. Pull Request: I'll submit a PR for review, incorporating any feedback provided. This plan will systematically address the issue by creating a targeted strategy, ensuring it functions correctly, and updating the documentation for users.

Options

Anshgrover23 avatar Jun 17 '24 06:06 Anshgrover23

@Subh231004 please keep the discussion related to your pull request on your pull request and not here. I have replied to your comment there.

abrichr avatar Jun 20 '24 13:06 abrichr

/attempt #760

Algora profile Completed bounties Tech Active attempts Options
@onyedikachi-david 2 bounties from 1 project
JavaScript, Shell
īšŸ764
Cancel attempt

onyedikachi-david avatar Jun 25 '24 15:06 onyedikachi-david

/attempt #760

Algora profile Completed bounties Tech Active attempts Options
@Ahmadkhan02 1 bounty from 1 project
TypeScript, Jupyter Notebook
Cancel attempt

Ahmadkhan02 avatar Jul 02 '24 20:07 Ahmadkhan02

💡 @onyedikachi-david submitted a pull request that claims the bounty. You can visit your bounty board to reward.

algora-pbc[bot] avatar Jul 04 '24 10:07 algora-pbc[bot]

/attempt #760

Algora profile Completed bounties Tech Active attempts Options
@varshith257 4 bounties from 2 projects
Python, Rust, TypeScript, Go
Cancel attempt

varshith257 avatar Jul 04 '24 20:07 varshith257

Hi @abrichr is this still available ?

stdthoth avatar Sep 12 '24 18:09 stdthoth

Hi @stdthoth , thanks for your interest.

We attempted a few different approaches at https://github.com/OpenAdaptAI/OpenAdapt/pull/867. It is available if you can implement a different approach that improves on the performance of any of these!

abrichr avatar Sep 12 '24 19:09 abrichr

/attempt #760

Options

stdthoth avatar Sep 12 '24 20:09 stdthoth

@abrichr i am working on it now... could you possibly assign this to me for a week ?

stdthoth avatar Sep 12 '24 20:09 stdthoth

Hi @stdthoth , thank you! Can you please clarify your request?

I just updated the description to include more details about the current approaches, recreated here:

I believe the next step here is to systematically evaluate the performance of these in a repeatable way (e.g. programmatically). Only then will we be able to implement the requirement to:

implement a different approach that improves on the performance of any of these

Please let me know if you have any questions!

Edit: if you prefer, you can also implement a novel approach, without evaluating these ones. But we will be unable to award the bounty until we can confirm that your approach outperforms all of these.

Edit: https://visualsketchpad.github.io/ may perform very well.

abrichr avatar Sep 12 '24 22:09 abrichr

/attempt https://github.com/OpenAdaptAI/OpenAdapt/issues/760

hoklims avatar Nov 19 '24 14:11 hoklims

💡 @hoklims submitted a pull request that claims the bounty. You can visit your bounty board to reward.

algora-pbc[bot] avatar Nov 19 '24 16:11 algora-pbc[bot]

Hi , is this issues still open ?

Girma35 avatar Jan 13 '25 11:01 Girma35

Hey @abrichr is this issue still available? If it is then would like to give it a try according to updated requirements in https://github.com/OpenAdaptAI/OpenAdapt/issues/760#issuecomment-2347337901. Should i directly go with a PR with my approach for this or first should discuss the approach?

neoandmatrix avatar May 17 '25 05:05 neoandmatrix

@neoandmatrix thank you for your interest!

Please propose an approach first. Note that it should be distinct from those already implemented.

abrichr avatar May 23 '25 20:05 abrichr

/attempt #760

TanCodeX avatar May 29 '25 09:05 TanCodeX

/attempt #760

blocator23 avatar Aug 01 '25 08:08 blocator23

I'll develop a source code based on a classification model and Python library in order to:

  • Classify the type of images files and visual expression.
  • Measure color, sizes, brights and edits.
  • Use of machine learning to support analysis and predictions.

blocator23 avatar Aug 03 '25 10:08 blocator23

/attempt #760

MAVRICK-1 avatar Aug 19 '25 18:08 MAVRICK-1