elide icon indicating copy to clipboard operation
elide copied to clipboard

feat(build-arena): AI-powered build performance benchmark system

Open rjwalters opened this issue 4 months ago β€’ 2 comments

Draft Powered by Pull Request Badge

Overview

Build Arena is an AI-powered benchmark system that races Elide against traditional Java build tools (Maven/Gradle) using autonomous Claude Code agents. This PR introduces the complete system including frontend, backend, Docker infrastructure, and observability tools.

Demo

https://github.com/user-attachments/assets/your-demo-video-here

What's Included

πŸ—οΈ Core Infrastructure

Backend (tools/build-arena/backend/)

  • Race API - Start/monitor build races between Elide and standard tools
  • WebSocket Servers - Real-time terminal streaming and race status updates
  • Job Management - Queue system with SQLite persistence
  • Race Minder - Autonomous agent that auto-approves Claude Code prompts
  • Container Management - Docker integration for isolated build environments

Frontend (tools/build-arena/frontend/)

  • Race Dashboard - Real-time race visualization with live terminal output
  • Build Metrics - Charts comparing build times, resource usage
  • Repository Form - Submit any Java GitHub repo for benchmarking
  • WebSocket Integration - Live updates from both containers

Docker Images (tools/build-arena/docker/)

  • elide-builder - Claude Code + Elide + Java 17
  • standard-builder - Claude Code + Maven + Gradle + Java 17
  • Multi-platform support (linux/amd64, linux/arm64)
  • Pre-configured for headless autonomous operation

πŸ€– Autonomous AI Agents

Race Minder (backend/src/services/race-minder.ts)

Monitors terminal WebSocket and automatically:

  • βœ… Approves API key confirmation
  • βœ… Approves workspace trust prompts (multiple Claude Code 2.0.30 variations)
  • βœ… Auto-approves git clone commands
  • βœ… Auto-approves build tool commands (elide, mvn, gradle)
  • βœ… Detects completion signals (bell emoji, "BUILD COMPLETE", etc.)
  • βœ… Handles API errors with retry logic

Detection Patterns:

// Workspace trust (Claude Code 2.0.30)
"Ready to code here?"                              // Standard
"Is this a project you created or one you trust"  // Elide
"Quick safety check"                               // Fallback

// Completion signals
/πŸ””/, /BUILD COMPLETE/i, /Build succeeded/i, /Total time:/i

Container Instructions (docker/CLAUDE.md)

Detailed step-by-step instructions for Claude Code:

  1. Clone repository
  2. Analyze project structure (Maven/Gradle detection)
  3. Execute timed build with appropriate tool
  4. Run tests to verify build
  5. Ring bell (πŸ””) to signal completion

πŸ“Š Observability & Debugging

Terminal Output Dumper (scripts/dump-terminal-output.ts)

cd backend
pnpm exec tsx ../scripts/dump-terminal-output.ts <containerId>
  • Connects to container WebSocket in read-only mode
  • Captures 10-second snapshot of terminal output
  • Shows message counts, output length, current state

Comprehensive Documentation (docs/OBSERVABILITY.md)

  • Quick start guide for monitoring races
  • Debugging workflows for common issues
  • Testing procedures for components
  • File reference with line numbers
  • Advanced debugging techniques

Backend Log Filtering

Use regex patterns to monitor specific events:

# Monitor all minder activity
Minder:

# Monitor approvals only
Auto-approving:|Bell rung|approved

# Monitor errors
Error|error|API Error

🎨 UI Features

Race View

  • Side-by-side terminals - Watch both builds in real-time
  • Live status updates - Connection state, approval counts
  • Build timer - Duration tracking for each container
  • Countdown to auto-start - Visual countdown before race begins
  • Completion detection - Automatic finish line detection

Build Metrics

  • Performance comparison - Bar charts of build times
  • Resource usage - Memory, CPU tracking (planned)
  • Success rate - Win/loss statistics per tool
  • Historical data - SQLite database persistence

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Frontend   β”‚ (React + Vite)
β”‚   :3000     β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚ HTTP + WebSocket
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Backend    β”‚ (Node.js + Express)
β”‚   :3001     β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚ Docker API
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚     Docker Containers           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚
β”‚  β”‚elide-builderβ”‚ β”‚standard-   β”‚β”‚
β”‚  β”‚             β”‚ β”‚builder     β”‚β”‚
β”‚  β”‚ Claude Code β”‚ β”‚Claude Code β”‚β”‚
β”‚  β”‚ + Elide     β”‚ β”‚+ Maven     β”‚β”‚
β”‚  β”‚ + Java 17   β”‚ β”‚+ Gradle    β”‚β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β–²
       β”‚ WebSocket (terminal I/O)
       β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”
β”‚ Race Minder β”‚ (Autonomous approval agent)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Workflows

Starting a Race

# 1. Start services
cd /Users/rwalters/GitHub/elide/tools/build-arena
pnpm dev

# 2. Submit a repository (API or UI)
curl -X POST http://localhost:3001/api/races/start \
  -H 'Content-Type: application/json' \
  -d '{"repositoryUrl": "https://github.com/google/gson"}'

# 3. Watch in browser
open http://localhost:3000

Monitoring with Observability Tools

# Get race status
curl http://localhost:3001/api/races/status/<jobId>

# Monitor terminal output
cd backend
pnpm exec tsx ../scripts/dump-terminal-output.ts <containerId>

# Check container health
docker ps --filter "name=race-"

Testing

Manual Testing

# Start race via API
curl -X POST http://localhost:3001/api/races/start \
  -H 'Content-Type: application/json' \
  -d '{"repositoryUrl": "https://github.com/google/gson"}'

# Monitor backend logs
tail -f backend/logs/app.log | grep "Minder:"

# Use terminal dumper
cd backend && pnpm exec tsx ../scripts/dump-terminal-output.ts <containerId>

Playwright Tests

cd /Users/rwalters/GitHub/elide/tools/build-arena
pnpm test tests/terminal-test.spec.ts
pnpm test tests/claude-autonomous-test.spec.ts

Project Structure

tools/build-arena/
β”œβ”€β”€ frontend/              # React frontend (Vite)
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ components/   # Terminal, Metrics, RepositoryForm
β”‚   β”‚   β”œβ”€β”€ pages/        # HomePage, TerminalTest, RaceView
β”‚   β”‚   └── hooks/        # useWebSocket, useRaceStatus
β”‚   └── package.json
β”œβ”€β”€ backend/              # Node.js backend
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ routes/       # API endpoints
β”‚   β”‚   β”œβ”€β”€ services/     # JobManager, RaceMinder, ContainerManager
β”‚   β”‚   β”œβ”€β”€ websocket/    # TerminalServer, RaceServer
β”‚   β”‚   └── db/           # SQLite schema
β”‚   └── package.json
β”œβ”€β”€ docker/               # Docker images
β”‚   β”œβ”€β”€ elide-builder.Dockerfile
β”‚   β”œβ”€β”€ standard-builder.Dockerfile
β”‚   β”œβ”€β”€ CLAUDE.md         # Instructions for autonomous builds
β”‚   └── build-images.sh
β”œβ”€β”€ scripts/              # Utility scripts
β”‚   └── dump-terminal-output.ts
β”œβ”€β”€ docs/                 # Documentation
β”‚   └── OBSERVABILITY.md
└── tests/                # Playwright tests
    β”œβ”€β”€ terminal-test.spec.ts
    └── claude-autonomous-test.spec.ts

Environment Setup

Prerequisites

  • Node.js 20+
  • pnpm
  • Docker Desktop
  • Anthropic API key

Installation

cd /Users/rwalters/GitHub/elide/tools/build-arena

# Install dependencies
pnpm install

# Set up environment
echo "ANTHROPIC_API_KEY=your-key-here" > backend/.env

# Build Docker images
cd docker && ./build-images.sh

# Initialize database
pnpm --filter @elide/build-arena-backend db:push

# Start services
pnpm dev

Technology Stack

  • Frontend: React 18, Vite, xterm.js, Recharts
  • Backend: Node.js, Express, SQLite (Drizzle ORM), WebSocket (ws)
  • Docker: Multi-platform images, Bash PTY sessions
  • AI: Claude Code CLI 2.0.30, Anthropic API
  • Testing: Playwright

Known Issues / Roadmap

Known Issues

  1. Claude Code premature exit - Sometimes exits after thinking without requesting commands. Investigating API timeout/error handling.
  2. Resource cleanup - Orphaned containers if backend crashes during race.

Roadmap

  • [ ] Minder status API endpoint for real-time state inspection
  • [ ] WebSocket recorder replay API for complete message history
  • [ ] Auto-restart Claude Code if it exits prematurely
  • [ ] Resource usage metrics (CPU, memory, disk I/O)
  • [ ] Multi-repository batch benchmarking
  • [ ] Leaderboard for popular repositories
  • [ ] GitHub Actions integration for CI benchmarking

Security Considerations

  • Docker containers are isolated with read-only filesystems where appropriate
  • API key stored in environment variables, not committed to repo
  • WebSocket connections validated with container ID checks
  • Build instructions limit Claude Code to repo cloning and building only

Performance

  • Concurrent races: Supports multiple simultaneous races
  • Resource limits: Docker containers have memory/CPU limits
  • Database: SQLite for lightweight persistence
  • WebSocket: Efficient binary protocol for terminal streaming

Related Issues

Addresses #1106 (Nomad integration) by providing infrastructure for autonomous build testing and performance benchmarking.

Draft Status

This PR is marked as draft for initial team review. Specifically looking for feedback on:

  1. Architecture - Is the container/minder/WebSocket design sound?
  2. Observability - Are the debugging tools sufficient?
  3. AI Agent Behavior - Race minder approval patterns and error handling
  4. UI/UX - Dashboard layout and real-time updates
  5. Documentation - Clarity and completeness

Ready for initial review of the complete Build Arena system. The core functionality works end-to-end. Primary focus areas:

  • Race minder detection patterns (workspace trust working great!)
  • Observability tools for debugging
  • Docker image configuration
  • Frontend real-time updates

rjwalters avatar Nov 15 '25 00:11 rjwalters

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 42.94%. Comparing base (c8c853d) to head (8ec8275).

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1755   +/-   ##
=======================================
  Coverage   42.94%   42.94%           
=======================================
  Files         895      895           
  Lines       42415    42415           
  Branches     5959     5959           
=======================================
  Hits        18216    18216           
  Misses      21997    21997           
  Partials     2202     2202           
Flag Coverage Ξ”
jvm 42.94% <ΓΈ> (ΓΈ)
lib 42.94% <ΓΈ> (ΓΈ)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more Ξ” = absolute <relative> (impact), ΓΈ = not affected, ? = missing data Powered by Codecov. Last update c8c853d...8ec8275. Read the comment docs.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov[bot] avatar Nov 15 '25 00:11 codecov[bot]