add support for gemini 2.5 cua
🤖 This PR adds support for Google's Gemini 2.5 Computer Use model as a new engine option (gemini-cua) in Skyvern, enabling users to leverage Google's latest computer vision and automation capabilities alongside existing OpenAI and Anthropic CUA engines. The implementation includes comprehensive integration across the entire stack from API definitions to frontend UI components.
🔍 Detailed Analysis
Key Changes
- New Engine Integration: Added
gemini-cuaas a newRunEngineandRunTypethroughout the codebase, including API schemas, database models, and client types - Gemini Client Setup: Integrated Google's
google-genailibrary (v1.43.0) with proper client initialization and configuration usingGEMINI_CUA_MODELsetting - Action Parsing System: Implemented comprehensive Gemini-specific action parsing in
parse_gemini_cua_actions()that handles computer use function calls and converts them to Skyvern actions - Computer Use State Management: Created
GeminiComputerUseStateclass to maintain conversation history and function call context across agent steps - Frontend Support: Added Gemini CUA option to the engine selector UI component for user selection
- New Action Types: Extended action system with
NAVIGATE,GO_BACK,GO_FORWARDactions and corresponding handlers for browser navigation
Technical Implementation
sequenceDiagram
participant User
participant API
participant Agent
participant Gemini
participant Browser
User->>API: Create task with gemini-cua engine
API->>Agent: Initialize with GeminiComputerUseState
Agent->>Gemini: Send screenshot + conversation history
Gemini->>Agent: Return function calls (click_at, type_text_at, etc.)
Agent->>Agent: Parse function calls to Skyvern actions
Agent->>Browser: Execute actions (click, type, navigate)
Browser->>Agent: Return results + new screenshot
Agent->>Gemini: Continue conversation with results
Impact
- Enhanced Model Options: Users can now choose from OpenAI, Anthropic, or Google's computer use models based on their specific needs and preferences
- Improved Browser Navigation: New navigation actions (navigate, go_back, go_forward) provide better browser control capabilities
- Robust Action Mapping: Comprehensive mapping from Gemini's computer use functions to Skyvern's action system ensures reliable automation
- Scalable Architecture: The implementation follows existing patterns, making it easy to add future computer use models
- Dependency Updates: Upgraded websockets library to support newer versions and added Google GenAI dependency
Created with Palmier
[!IMPORTANT]
Review skipped
Draft detected.
Please check the settings in the CodeRabbit UI or the
.coderabbit.yamlfile in this repository. To trigger a single review, invoke the@coderabbitai reviewcommand.You can disable this status message by setting the
reviews.review_statustofalsein the CodeRabbit configuration file.
✨ Finishing touches
🧪 Generate unit tests (beta)
- [ ] Create PR with unit tests
- [ ] Post copyable unit tests in a comment
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.
Comment @coderabbitai help to get the list of available commands and usage tips.