[Agent] Support browser control via screenshots
What problem or use case are you trying to solve?
Implement a tool similar to the computer tool & allow it to control the browser directly.
Describe the UX of the solution you'd like
Do you have thoughts on the technical implementation?
Describe alternatives you've considered
Additional context
here is what you need #4581
@ryanhoangt Can you also self-assign this one?
Why not just use computer use directly?
I think we may want to first try to use computer use as another approach to implement browsing capability -- currently it's based on text-based observation only. If it helps achieve performance boost, we can stick with it by default (or for claude), and fall back to old browsing implementation for other models. Not sure if the team has some other ideas to share on this.
Hey @ryx2 - I think that's a good idea and i've been discussing with @ryanhoangt to make computer-control the next low hanging fruit we could pursue to improve browsing experience (at least for using claude)
Computer use can become extremely expensive if screenshots (images) are being used, compared to text-only approach. Just took a look at e.g. OpenRouter (edit: $/K images): Sonnet-3.5: $4.8 Gpt-4o: $3.613 Gpt-4o-mini: $7.225 Gemini Flash 1.5: $0.04 Gemini Pro 1.5: $0.675 Might be helpful, if the preferred vision model could be defined somehow then.
I looked into OpenRouter pricing and seems like it's $4.8 / 1k images for Sonnet-3.5 🤔
Hmm you have a link to where it says per "1K"?
This one: https://openrouter.ai/anthropic/claude-3.5-sonnet
Ohh, you're right, I missed that notation.
Qwen2.5-VL is good, if you guys are concern about the price. It beats the old version 4o and sonnet3.5.
see https://qwenlm.github.io/blog/qwen2-vl/
it can be self-hosted
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
Reference implementation: https://github.com/invariantlabs-ai/playwright-computer-use
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
I believe this is already somewhat completed in https://github.com/All-Hands-AI/OpenHands/pull/6464