self-operating-computer
self-operating-computer copied to clipboard
Object detection
Maybe a yolo object detection model trained on basic things to get coordinates? Or something like sam?
i mean as soon as there is a small model, gpt4 can check the dataset and add more correct examples to the trainigset
@admineral This is an interesting idea!
Sounds awesome. We could look at using this: https://github.com/luca-medeiros/lang-segment-anything
@admineral Yeah sorry, not sure why I closed this. I must've been tired. This is a good idea!
@KBB99 Looking at this now and it looks really interesting. I hadn't considered a segmentation model before
Sounds good. I'll take a stab at implementing Lang Segment Anything and make a pull request if it seems promising.
Have a long way to go, but was able to get it integrated. I need to make sure the calculations I'm performing are correct as well as modify the ratio conversion. Right now it's incredibly slow as I'm running it locally, but I'll try to host an endpoint backed by a GPU to speed it up. Later I'd like to explore a RL mechanism similar to what @admineral mentioned to improve pixel coordinates estimation.
If anyone wants to continue working on it feel free to clone the fork! Just a note you need to use Python3.9 and the Lang model takes over 4GB of space.
@KBB99 this sounds promising, interested to see more as you make progress!
@Daisuke134 I would be interested to get your input as I saw you are working on something similar with Set-of-Marks. I have used LangSam to more accurately mask the objects, then I ask GPT-4-V a follow-up question combining the masks with marks and asking it to specify which object to click on, then parse GPT-4-V output, and finally click on the center of the object. Here is the code: https://github.com/OthersideAI/self-operating-computer/compare/main...KBB99:self-operating-computer:main .
@joshbickett how do you suggest building this into the project? operate --click_mode=lang-som ?
@KBB99 Thank you! I was trying to implement SoM, but the accuracy of labels for screenshots were quite low, so I was trying to look into other ways of solving the problem.
I was trying to implement som using "som-mode", and implementing prompt into VISION PROMPT. And do "operate -som"to activate som mode.
I will check and try your code. Thank you so much! It will be great if you could make a PR!!
@KBB99 Looks great! I was looking into your code, and here are some thoughts I had.
・I think you are using summary_prompt to make GPT-4 respond with the label for the next action, but this is probably not related to SUMMARY_PROMPT, right? I found it a bit confusing. Maybe changing the names for clarity would be better (e.g., sam_prompt).
・Question: Are you using summary_prompt after VISION_PROMPT?
・What do you think about combining summary_prompt and VISION_PROMPT? First, we could give GPT-4 a "screenshot with Sam (instead of a grid) + ask it to provide the specific label (CLICK {{ “label”: “C", “description”... )". This way, we wouldn’t need to request GPT-4 twice, reducing operation time.
Here's how I wrote it by creating a new mode called som_mode:
- Use segemented image if using som_mode, else use grid image as before.
def get_next_action_from_openai(messages, objective, accurate_mode, som_mode):
if som_mode:
try:
som_screenshot_filename = os.path.join(screenshots_dir, "screenshot_som.png")
generate_sam_masks(screenshot_filename, som_screenshot_filename)
img_file_to_use = som_screenshot_filename
except Exception as e:
if DEBUG:
print(f"Error in SoM processing: {e}")
else:
grid_screenshot_filename = os.path.join(screenshots_dir, "screenshot_with_grid.png")
add_grid_to_image(screenshot_filename, grid_screenshot_filename, 500)
img_file_to_use = grid_screenshot_filename
time.sleep(1)
with open(img_file_to_use, "rb") as img_file:
img_base64 = base64.b64encode(img_file.read()).decode("utf-8")
- By using the code below, I can activate som_mode with "operate -som".
def main_entry():
# Add SoM to image
parser.add_argument(
"-som",
help="Activate SOM Mode",
action="store_true",
required=False,
)
try:
args = parser.parse_args()
main(
args.model,
accurate_mode=args.accurate,
terminal_prompt=args.prompt,
som_mode=args.som,
voice_mode=args.voice,
)
- Add labels to VISION PROMPT.
1. CLICK
Response:
- For a screenshot with a grid:
CLICK {{ “x”: “percent”, “y”: “percent”, “description”: “~description here~“, “reason”: “~reason here~” }}
Note: The percents work where the top left corner is “x”: “0%” and “y”: “0%” and the bottom right corner is “x”: “100%” and “y”: “100%”
- For a screenshot with numbered labels:
CLICK {{ “label”: “number”, “description”: “~description here~“, “reason”: “~reason here~” }}
Note: Use the number that is labelled on the desired element. If the targeted area is not labelled, revert to the grid format.
Here are examples of how to respond.
Objective: Log in to the account
CLICK {{ “label”: “C", “description”: “Click on the button labelled ‘2’, which is the ‘Login’ button”, “reason”: “C is identified as the ‘Login’ button, and clicking it will initiate the login process.” }}
※I think that adding a new sam-mode is better, since we can check the accuracy difference between the conventional method and the SAM method. However, I am not sure if we should make a new prompt like SAM_PROMPT, or integrate it into the existing VISION_PROMPT.
Thank you so much🙇♀️ Let me know what you think.
@joshbickett how do you suggest building this into the project? operate --click_mode=lang-som ?
@KBB99 excited to look closer. Let me checkout the code and provide some input in the next few days
@KBB99 I did git checkout
on your fork. I am encountering the following issue.
TypeError: mouse_click() missing 2 required positional arguments: 'client' and 'messages'
It looks like mouse_click
was expecting additional arguments
if action_type == "SEARCH":
function_response = search(action_detail)
elif action_type == "TYPE":
function_response = keyboard_type(action_detail)
elif action_type == "CLICK":
function_response = mouse_click(action_detail)
else:
...
def mouse_click(click_detail, client, messages):
After adjusting that in the code so that mouse_click
doesn't encounter that error I got the following result, but didn't see the mouse move or click .
[Self-Operating Computer] [Act] CLICK {'visual_description': "First link in the list of articles, titled 'Non-interactive SSH password authentication'", 'reason': 'To open the top article as requested'}
I am interested in this lang-som
approach but it looks like it may need more work, let me know if there's something I was doing wrong. I'd love to try out a working version.
Hey @joshbickett . Yes I made some changes to clean up the code before pushing and must have broke something accidentally, will take a look and update any glitches. Thanks for checking it out!
@KBB99 Hi. I was trying to test out your code, but had some issues installing lang_sam. I am using Python 3.12 and tried to install torch and torchvision but was not able to.
We need to pip install torch torchvision pip install -U git+https://github.com/luca-medeiros/lang-segment-anything.git to run the project, right?
@Daisuke134 i was able to install with Python 3.9 and the commands you mentioned.
@joshbickett I updated the incorrect arguments you mentioned so that is fixed. It is moving and clicking for me so I suspect that for the generated visual description, First link in the list of articles, titled 'Non-interactive SSH password authentication
lang-sam did not identify the object and hence couldn't click on anything. For me it sometimes segments and masks the objects correctly, but other times not. I need to experiment more and improve the prompts as well as the marks.
@Daisuke134 the commands you mentioned are correct assuming you start with python 3.9 venv (to do so set up the venv by running python3.9 -m venv venv
). Also, don't forget to pull the latest changes that fix the arguments bug Josh mentioned.
I will do some more tests, improvements, and then make a pull request tomorrow integrating the suggestions @Daisuke134 mentioned.
@joshbickett @KBB99 Thank you. I could run the code using 3.9.18. However, when I did operate and "Go to youtube.com and play some holiday music", there is an error saying
Error parsing JSON: 'ascii' codec can't encode character '\u2018' in position 7: ordinal not in range(128) [Self-Operating Computer][Error] something went wrong :( [Self-Operating Computer][Error] AI response Failed take action after looking at the screenshot
Probably this is occuring because ' is included when json parsing but.. Any idea how to solve the problem?
You only made changes in main.py, right?
Any idea how to solve the problem?
sounds like you were able to resolve this issue, but wanted to mention that I had to run pip install torchvision
separate before lang-segment-anything
So I've managed to fix the clicking errors as well as add a configuration that saves the screenshots, masked screenshots, prompt, and predicted coordinates. I'll run more tests and keep expanding the dataset to test the segmentation model individually and then make further changes. Sometimes lang-sam works very well and segments the targeted object perfectly making clicks extremely accurate, other times not.
@joshbickett are you aware of any dataset of screenshots with prompt and coordinates we could use to evaluate different approaches?
run
pip install torchvision
separate beforelang-segment-anything
Meaning doing "pip install torchvision" before "pip install -U git+https://github.com/luca-medeiros/lang-segment-anything.git"?
Also, I am still having the error with parsing JSON.. I will check the code again
@Daisuke134 I added a section to the README.md try following the one with conda for version management.
@admineral @KBB99 Set-of-Mark prompting is now available. You can swap in any best.pt
from a YOLOv8 model and see how it performs. I'd love if the community iterated on what I built. My best.pt
could be improved, but the structure is now in to improve upon.
@joshbickett I've played around with the SOM you implemented, but for some reason GPT-V seems to pick the wrong marks and still doesn't click on the correct object. I think some additional context, like the text could help GPT-V pick the correct object better. I've been experimenting with Amazon Textract Layout and it seems pretty solid, capturing text and the layout. Take a look at the screenshot. I'll test it with self-operating-computer and let you know how it goes.
@KBB99 sounds good. If it improves performance it'd be great to get a PR
@KBB99 I'll close this PR for now that we're using the OCR method as default. If anyone comes up with an improved YOLO model or SOM technique feel free to open a new ticket or a PR!
Please check out YOLO-WORLD https://blog.roboflow.com/what-is-yolo-world/
https://huggingface.co/spaces/stevengrove/YOLO-World?ref=blog.roboflow.com
https://github.com/AILAB-CVC/YOLO-World?ref=blog.roboflow.com