cua
cua copied to clipboard
Add Computer.tracing for Recording Sessions
Problem
Currently, session recording is only available through ComputerAgent(save_trajectory=...) or the Computer demonstration UI, which has several limitations:
- Limited to ComputerAgent: Users who implement custom agents or call
Computerdirectly cannot record sessions - Inflexible for advanced use cases: Training, replay, and debugging scenarios need more customizable recording options (e.g., storing accessibility trees)
- Format inconsistency:
ComputerAgentand the Computer demonstration Gradio UI use different recording formats - No human-in-the-loop support: Manual interactions and hybrid workflows can't be properly recorded
Proposed Solution
Add a Computer.tracing API inspired by Playwright's tracing functionality:
# Start tracing with configurable options
await computer.tracing.start({
'video': True,
'screenshots': True,
'api_calls': True,
'accessibility_tree': True, # For training/debugging
'metadata': True # Custom metadata support
})
# Perform agent operations
agent = ComputerAgent(computer=computer, ...)
async for _ in agent.run("open trycua/cua"):
pass
# Or direct computer operations
await computer.interface.click(x, y)
await computer.interface.type("hello world")
# Stop tracing and save
await computer.tracing.stop({'path': 'trace.zip'})
Use Cases
- Custom agent development: Record sessions during agent development and testing
- Training data collection: Capture rich interaction data for model training
- RPA debugging: Record robotic process automation workflows to diagnose failures and optimize performance
- UI unit testing: Capture automated UI test sessions for test result analysis and flaky test debugging
- Human-in-the-loop: Record mixed human/agent sessions for workflow analysis
- Compliance/audit: Keep records of automated actions for regulatory purposes
- Performance monitoring: Record sessions to analyze agent performance and identify bottlenecks
I’m interested in working on implementing Computer.tracing for session recording. I plan to create a modular async API that supports video, screenshots, API calls, accessibility tree, and metadata, compatible with both ComputerAgent and direct Computer operations. This will enable richer session recording for training, debugging, and human-in-the-loop workflows. @f-trycua assign this to me?