feat(build-arena): AI-powered build performance benchmark system
Overview
Build Arena is an AI-powered benchmark system that races Elide against traditional Java build tools (Maven/Gradle) using autonomous Claude Code agents. This PR introduces the complete system including frontend, backend, Docker infrastructure, and observability tools.
Demo
https://github.com/user-attachments/assets/your-demo-video-here
What's Included
ποΈ Core Infrastructure
Backend (tools/build-arena/backend/)
- Race API - Start/monitor build races between Elide and standard tools
- WebSocket Servers - Real-time terminal streaming and race status updates
- Job Management - Queue system with SQLite persistence
- Race Minder - Autonomous agent that auto-approves Claude Code prompts
- Container Management - Docker integration for isolated build environments
Frontend (tools/build-arena/frontend/)
- Race Dashboard - Real-time race visualization with live terminal output
- Build Metrics - Charts comparing build times, resource usage
- Repository Form - Submit any Java GitHub repo for benchmarking
- WebSocket Integration - Live updates from both containers
Docker Images (tools/build-arena/docker/)
- elide-builder - Claude Code + Elide + Java 17
- standard-builder - Claude Code + Maven + Gradle + Java 17
- Multi-platform support (linux/amd64, linux/arm64)
- Pre-configured for headless autonomous operation
π€ Autonomous AI Agents
Race Minder (backend/src/services/race-minder.ts)
Monitors terminal WebSocket and automatically:
- β Approves API key confirmation
- β Approves workspace trust prompts (multiple Claude Code 2.0.30 variations)
- β
Auto-approves
git clonecommands - β
Auto-approves build tool commands (
elide,mvn,gradle) - β Detects completion signals (bell emoji, "BUILD COMPLETE", etc.)
- β Handles API errors with retry logic
Detection Patterns:
// Workspace trust (Claude Code 2.0.30)
"Ready to code here?" // Standard
"Is this a project you created or one you trust" // Elide
"Quick safety check" // Fallback
// Completion signals
/π/, /BUILD COMPLETE/i, /Build succeeded/i, /Total time:/i
Container Instructions (docker/CLAUDE.md)
Detailed step-by-step instructions for Claude Code:
- Clone repository
- Analyze project structure (Maven/Gradle detection)
- Execute timed build with appropriate tool
- Run tests to verify build
- Ring bell (π) to signal completion
π Observability & Debugging
Terminal Output Dumper (scripts/dump-terminal-output.ts)
cd backend
pnpm exec tsx ../scripts/dump-terminal-output.ts <containerId>
- Connects to container WebSocket in read-only mode
- Captures 10-second snapshot of terminal output
- Shows message counts, output length, current state
Comprehensive Documentation (docs/OBSERVABILITY.md)
- Quick start guide for monitoring races
- Debugging workflows for common issues
- Testing procedures for components
- File reference with line numbers
- Advanced debugging techniques
Backend Log Filtering
Use regex patterns to monitor specific events:
# Monitor all minder activity
Minder:
# Monitor approvals only
Auto-approving:|Bell rung|approved
# Monitor errors
Error|error|API Error
π¨ UI Features
Race View
- Side-by-side terminals - Watch both builds in real-time
- Live status updates - Connection state, approval counts
- Build timer - Duration tracking for each container
- Countdown to auto-start - Visual countdown before race begins
- Completion detection - Automatic finish line detection
Build Metrics
- Performance comparison - Bar charts of build times
- Resource usage - Memory, CPU tracking (planned)
- Success rate - Win/loss statistics per tool
- Historical data - SQLite database persistence
Architecture
βββββββββββββββ
β Frontend β (React + Vite)
β :3000 β
ββββββββ¬βββββββ
β HTTP + WebSocket
βΌ
βββββββββββββββ
β Backend β (Node.js + Express)
β :3001 β
ββββββββ¬βββββββ
β Docker API
βΌ
βββββββββββββββββββββββββββββββββββ
β Docker Containers β
β βββββββββββββββ βββββββββββββββ
β βelide-builderβ βstandard- ββ
β β β βbuilder ββ
β β Claude Code β βClaude Code ββ
β β + Elide β β+ Maven ββ
β β + Java 17 β β+ Gradle ββ
β βββββββββββββββ βββββββββββββββ
βββββββββββββββββββββββββββββββββββ
β²
β WebSocket (terminal I/O)
β
ββββββββ΄βββββββ
β Race Minder β (Autonomous approval agent)
βββββββββββββββ
Key Workflows
Starting a Race
# 1. Start services
cd /Users/rwalters/GitHub/elide/tools/build-arena
pnpm dev
# 2. Submit a repository (API or UI)
curl -X POST http://localhost:3001/api/races/start \
-H 'Content-Type: application/json' \
-d '{"repositoryUrl": "https://github.com/google/gson"}'
# 3. Watch in browser
open http://localhost:3000
Monitoring with Observability Tools
# Get race status
curl http://localhost:3001/api/races/status/<jobId>
# Monitor terminal output
cd backend
pnpm exec tsx ../scripts/dump-terminal-output.ts <containerId>
# Check container health
docker ps --filter "name=race-"
Testing
Manual Testing
# Start race via API
curl -X POST http://localhost:3001/api/races/start \
-H 'Content-Type: application/json' \
-d '{"repositoryUrl": "https://github.com/google/gson"}'
# Monitor backend logs
tail -f backend/logs/app.log | grep "Minder:"
# Use terminal dumper
cd backend && pnpm exec tsx ../scripts/dump-terminal-output.ts <containerId>
Playwright Tests
cd /Users/rwalters/GitHub/elide/tools/build-arena
pnpm test tests/terminal-test.spec.ts
pnpm test tests/claude-autonomous-test.spec.ts
Project Structure
tools/build-arena/
βββ frontend/ # React frontend (Vite)
β βββ src/
β β βββ components/ # Terminal, Metrics, RepositoryForm
β β βββ pages/ # HomePage, TerminalTest, RaceView
β β βββ hooks/ # useWebSocket, useRaceStatus
β βββ package.json
βββ backend/ # Node.js backend
β βββ src/
β β βββ routes/ # API endpoints
β β βββ services/ # JobManager, RaceMinder, ContainerManager
β β βββ websocket/ # TerminalServer, RaceServer
β β βββ db/ # SQLite schema
β βββ package.json
βββ docker/ # Docker images
β βββ elide-builder.Dockerfile
β βββ standard-builder.Dockerfile
β βββ CLAUDE.md # Instructions for autonomous builds
β βββ build-images.sh
βββ scripts/ # Utility scripts
β βββ dump-terminal-output.ts
βββ docs/ # Documentation
β βββ OBSERVABILITY.md
βββ tests/ # Playwright tests
βββ terminal-test.spec.ts
βββ claude-autonomous-test.spec.ts
Environment Setup
Prerequisites
- Node.js 20+
- pnpm
- Docker Desktop
- Anthropic API key
Installation
cd /Users/rwalters/GitHub/elide/tools/build-arena
# Install dependencies
pnpm install
# Set up environment
echo "ANTHROPIC_API_KEY=your-key-here" > backend/.env
# Build Docker images
cd docker && ./build-images.sh
# Initialize database
pnpm --filter @elide/build-arena-backend db:push
# Start services
pnpm dev
Technology Stack
- Frontend: React 18, Vite, xterm.js, Recharts
- Backend: Node.js, Express, SQLite (Drizzle ORM), WebSocket (ws)
- Docker: Multi-platform images, Bash PTY sessions
- AI: Claude Code CLI 2.0.30, Anthropic API
- Testing: Playwright
Known Issues / Roadmap
Known Issues
- Claude Code premature exit - Sometimes exits after thinking without requesting commands. Investigating API timeout/error handling.
- Resource cleanup - Orphaned containers if backend crashes during race.
Roadmap
- [ ] Minder status API endpoint for real-time state inspection
- [ ] WebSocket recorder replay API for complete message history
- [ ] Auto-restart Claude Code if it exits prematurely
- [ ] Resource usage metrics (CPU, memory, disk I/O)
- [ ] Multi-repository batch benchmarking
- [ ] Leaderboard for popular repositories
- [ ] GitHub Actions integration for CI benchmarking
Security Considerations
- Docker containers are isolated with read-only filesystems where appropriate
- API key stored in environment variables, not committed to repo
- WebSocket connections validated with container ID checks
- Build instructions limit Claude Code to repo cloning and building only
Performance
- Concurrent races: Supports multiple simultaneous races
- Resource limits: Docker containers have memory/CPU limits
- Database: SQLite for lightweight persistence
- WebSocket: Efficient binary protocol for terminal streaming
Related Issues
Addresses #1106 (Nomad integration) by providing infrastructure for autonomous build testing and performance benchmarking.
Draft Status
This PR is marked as draft for initial team review. Specifically looking for feedback on:
- Architecture - Is the container/minder/WebSocket design sound?
- Observability - Are the debugging tools sufficient?
- AI Agent Behavior - Race minder approval patterns and error handling
- UI/UX - Dashboard layout and real-time updates
- Documentation - Clarity and completeness
Ready for initial review of the complete Build Arena system. The core functionality works end-to-end. Primary focus areas:
- Race minder detection patterns (workspace trust working great!)
- Observability tools for debugging
- Docker image configuration
- Frontend real-time updates
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
Codecov Report
:white_check_mark: All modified and coverable lines are covered by tests.
:white_check_mark: Project coverage is 42.94%. Comparing base (c8c853d) to head (8ec8275).
Additional details and impacted files
@@ Coverage Diff @@
## main #1755 +/- ##
=======================================
Coverage 42.94% 42.94%
=======================================
Files 895 895
Lines 42415 42415
Branches 5959 5959
=======================================
Hits 18216 18216
Misses 21997 21997
Partials 2202 2202
| Flag | Coverage Ξ | |
|---|---|---|
| jvm | 42.94% <ΓΈ> (ΓΈ) |
|
| lib | 42.94% <ΓΈ> (ΓΈ) |
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report in Codecov by Sentry.
Legend - Click here to learn more
Ξ = absolute <relative> (impact),ΓΈ = not affected,? = missing dataPowered by Codecov. Last update c8c853d...8ec8275. Read the comment docs.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.