add support for gemini 2.5 cua

Open gswangg opened this issue 2 months ago • 1 comments

🤖 This PR adds support for Google's Gemini 2.5 Computer Use model as a new engine option (gemini-cua) in Skyvern, enabling users to leverage Google's latest computer vision and automation capabilities alongside existing OpenAI and Anthropic CUA engines. The implementation includes comprehensive integration across the entire stack from API definitions to frontend UI components.

🔍 Detailed Analysis

Key Changes

New Engine Integration: Added gemini-cua as a new RunEngine and RunType throughout the codebase, including API schemas, database models, and client types
Gemini Client Setup: Integrated Google's google-genai library (v1.43.0) with proper client initialization and configuration using GEMINI_CUA_MODEL setting
Action Parsing System: Implemented comprehensive Gemini-specific action parsing in parse_gemini_cua_actions() that handles computer use function calls and converts them to Skyvern actions
Computer Use State Management: Created GeminiComputerUseState class to maintain conversation history and function call context across agent steps
Frontend Support: Added Gemini CUA option to the engine selector UI component for user selection
New Action Types: Extended action system with NAVIGATE, GO_BACK, GO_FORWARD actions and corresponding handlers for browser navigation

Technical Implementation

sequenceDiagram
    participant User
    participant API
    participant Agent
    participant Gemini
    participant Browser
    
    User->>API: Create task with gemini-cua engine
    API->>Agent: Initialize with GeminiComputerUseState
    Agent->>Gemini: Send screenshot + conversation history
    Gemini->>Agent: Return function calls (click_at, type_text_at, etc.)
    Agent->>Agent: Parse function calls to Skyvern actions
    Agent->>Browser: Execute actions (click, type, navigate)
    Browser->>Agent: Return results + new screenshot
    Agent->>Gemini: Continue conversation with results

Impact

Enhanced Model Options: Users can now choose from OpenAI, Anthropic, or Google's computer use models based on their specific needs and preferences
Improved Browser Navigation: New navigation actions (navigate, go_back, go_forward) provide better browser control capabilities
Robust Action Mapping: Comprehensive mapping from Gemini's computer use functions to Skyvern's action system ensures reliable automation
Scalable Architecture: The implementation follows existing patterns, making it easy to add future computer use models
Dependency Updates: Upgraded websockets library to support newer versions and added Google GenAI dependency

Created with Palmier

Oct 13 '25 02:10 gswangg

[!IMPORTANT]

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches

🧪 Generate unit tests (beta)

[ ] Create PR with unit tests
[ ] Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Oct 13 '25 02:10 coderabbitai[bot]