OpenHands icon indicating copy to clipboard operation
OpenHands copied to clipboard

[Agent] Support browser control via screenshots

Open xingyaoww opened this issue 1 year ago • 14 comments

What problem or use case are you trying to solve?

Implement a tool similar to the computer tool & allow it to control the browser directly.

Describe the UX of the solution you'd like

Do you have thoughts on the technical implementation?

Describe alternatives you've considered

Additional context

xingyaoww avatar Oct 25 '24 18:10 xingyaoww

here is what you need #4581

x66ccff avatar Oct 27 '24 01:10 x66ccff

@ryanhoangt Can you also self-assign this one?

xingyaoww avatar Nov 22 '24 19:11 xingyaoww

Why not just use computer use directly?

ryx2 avatar Nov 22 '24 22:11 ryx2

I think we may want to first try to use computer use as another approach to implement browsing capability -- currently it's based on text-based observation only. If it helps achieve performance boost, we can stick with it by default (or for claude), and fall back to old browsing implementation for other models. Not sure if the team has some other ideas to share on this.

ryanhoangt avatar Nov 23 '24 03:11 ryanhoangt

Hey @ryx2 - I think that's a good idea and i've been discussing with @ryanhoangt to make computer-control the next low hanging fruit we could pursue to improve browsing experience (at least for using claude)

xingyaoww avatar Nov 23 '24 05:11 xingyaoww

Computer use can become extremely expensive if screenshots (images) are being used, compared to text-only approach. Just took a look at e.g. OpenRouter (edit: $/K images): Sonnet-3.5: $4.8 Gpt-4o: $3.613 Gpt-4o-mini: $7.225 Gemini Flash 1.5: $0.04 Gemini Pro 1.5: $0.675 Might be helpful, if the preferred vision model could be defined somehow then.

tobitege avatar Nov 23 '24 06:11 tobitege

I looked into OpenRouter pricing and seems like it's $4.8 / 1k images for Sonnet-3.5 🤔

ryanhoangt avatar Nov 23 '24 06:11 ryanhoangt

Hmm you have a link to where it says per "1K"?

tobitege avatar Nov 23 '24 06:11 tobitege

This one: https://openrouter.ai/anthropic/claude-3.5-sonnet

ryanhoangt avatar Nov 23 '24 06:11 ryanhoangt

Ohh, you're right, I missed that notation.

tobitege avatar Nov 23 '24 06:11 tobitege

Qwen2.5-VL is good, if you guys are concern about the price. It beats the old version 4o and sonnet3.5.

see https://qwenlm.github.io/blog/qwen2-vl/

it can be self-hosted

x66ccff avatar Nov 23 '24 12:11 x66ccff

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Dec 26 '24 01:12 github-actions[bot]

Reference implementation: https://github.com/invariantlabs-ai/playwright-computer-use

ryanhoangt avatar Feb 03 '25 22:02 ryanhoangt

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Mar 06 '25 02:03 github-actions[bot]

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Apr 06 '25 02:04 github-actions[bot]

I believe this is already somewhat completed in https://github.com/All-Hands-AI/OpenHands/pull/6464

xingyaoww avatar Apr 07 '25 01:04 xingyaoww