elide feat(build-arena): AI-powered build performance benchmark system

Overview

Build Arena is an AI-powered benchmark system that races Elide against traditional Java build tools (Maven/Gradle) using autonomous Claude Code agents. This PR introduces the complete system including frontend, backend, Docker infrastructure, and observability tools.

Demo

https://github.com/user-attachments/assets/your-demo-video-here

What's Included

🏗️ Core Infrastructure

Backend (`tools/build-arena/backend/`)

Race API - Start/monitor build races between Elide and standard tools
WebSocket Servers - Real-time terminal streaming and race status updates
Job Management - Queue system with SQLite persistence
Race Minder - Autonomous agent that auto-approves Claude Code prompts
Container Management - Docker integration for isolated build environments

Frontend (`tools/build-arena/frontend/`)

Race Dashboard - Real-time race visualization with live terminal output
Build Metrics - Charts comparing build times, resource usage
Repository Form - Submit any Java GitHub repo for benchmarking
WebSocket Integration - Live updates from both containers

Docker Images (`tools/build-arena/docker/`)

elide-builder - Claude Code + Elide + Java 17
standard-builder - Claude Code + Maven + Gradle + Java 17
Multi-platform support (linux/amd64, linux/arm64)
Pre-configured for headless autonomous operation

🤖 Autonomous AI Agents

Race Minder (`backend/src/services/race-minder.ts`)

Monitors terminal WebSocket and automatically:

✅ Approves API key confirmation
✅ Approves workspace trust prompts (multiple Claude Code 2.0.30 variations)
✅ Auto-approves git clone commands
✅ Auto-approves build tool commands (elide, mvn, gradle)
✅ Detects completion signals (bell emoji, "BUILD COMPLETE", etc.)
✅ Handles API errors with retry logic

Detection Patterns:

// Workspace trust (Claude Code 2.0.30)
"Ready to code here?"                              // Standard
"Is this a project you created or one you trust"  // Elide
"Quick safety check"                               // Fallback

// Completion signals
/🔔/, /BUILD COMPLETE/i, /Build succeeded/i, /Total time:/i

Container Instructions (`docker/CLAUDE.md`)

Detailed step-by-step instructions for Claude Code:

Clone repository
Analyze project structure (Maven/Gradle detection)
Execute timed build with appropriate tool
Run tests to verify build
Ring bell (🔔) to signal completion

📊 Observability & Debugging

Terminal Output Dumper (`scripts/dump-terminal-output.ts`)

cd backend
pnpm exec tsx ../scripts/dump-terminal-output.ts <containerId>

Connects to container WebSocket in read-only mode
Captures 10-second snapshot of terminal output
Shows message counts, output length, current state

Comprehensive Documentation (`docs/OBSERVABILITY.md`)

Quick start guide for monitoring races
Debugging workflows for common issues
Testing procedures for components
File reference with line numbers
Advanced debugging techniques

Backend Log Filtering

Use regex patterns to monitor specific events:

# Monitor all minder activity
Minder:

# Monitor approvals only
Auto-approving:|Bell rung|approved

# Monitor errors
Error|error|API Error

🎨 UI Features

Race View

Side-by-side terminals - Watch both builds in real-time
Live status updates - Connection state, approval counts
Build timer - Duration tracking for each container
Countdown to auto-start - Visual countdown before race begins
Completion detection - Automatic finish line detection

Build Metrics

Performance comparison - Bar charts of build times
Resource usage - Memory, CPU tracking (planned)
Success rate - Win/loss statistics per tool
Historical data - SQLite database persistence

Architecture

┌─────────────┐
│  Frontend   │ (React + Vite)
│   :3000     │
└──────┬──────┘
       │ HTTP + WebSocket
       ▼
┌─────────────┐
│  Backend    │ (Node.js + Express)
│   :3001     │
└──────┬──────┘
       │ Docker API
       ▼
┌─────────────────────────────────┐
│     Docker Containers           │
│  ┌─────────────┐ ┌────────────┐│
│  │elide-builder│ │standard-   ││
│  │             │ │builder     ││
│  │ Claude Code │ │Claude Code ││
│  │ + Elide     │ │+ Maven     ││
│  │ + Java 17   │ │+ Gradle    ││
│  └─────────────┘ └────────────┘│
└─────────────────────────────────┘
       ▲
       │ WebSocket (terminal I/O)
       │
┌──────┴──────┐
│ Race Minder │ (Autonomous approval agent)
└─────────────┘

Key Workflows

Starting a Race

# 1. Start services
cd /Users/rwalters/GitHub/elide/tools/build-arena
pnpm dev

# 2. Submit a repository (API or UI)
curl -X POST http://localhost:3001/api/races/start \
  -H 'Content-Type: application/json' \
  -d '{"repositoryUrl": "https://github.com/google/gson"}'

# 3. Watch in browser
open http://localhost:3000

Monitoring with Observability Tools

# Get race status
curl http://localhost:3001/api/races/status/<jobId>

# Monitor terminal output
cd backend
pnpm exec tsx ../scripts/dump-terminal-output.ts <containerId>

# Check container health
docker ps --filter "name=race-"

Testing

Manual Testing

# Start race via API
curl -X POST http://localhost:3001/api/races/start \
  -H 'Content-Type: application/json' \
  -d '{"repositoryUrl": "https://github.com/google/gson"}'

# Monitor backend logs
tail -f backend/logs/app.log | grep "Minder:"

# Use terminal dumper
cd backend && pnpm exec tsx ../scripts/dump-terminal-output.ts <containerId>

Playwright Tests

cd /Users/rwalters/GitHub/elide/tools/build-arena
pnpm test tests/terminal-test.spec.ts
pnpm test tests/claude-autonomous-test.spec.ts

Project Structure

tools/build-arena/
├── frontend/              # React frontend (Vite)
│   ├── src/
│   │   ├── components/   # Terminal, Metrics, RepositoryForm
│   │   ├── pages/        # HomePage, TerminalTest, RaceView
│   │   └── hooks/        # useWebSocket, useRaceStatus
│   └── package.json
├── backend/              # Node.js backend
│   ├── src/
│   │   ├── routes/       # API endpoints
│   │   ├── services/     # JobManager, RaceMinder, ContainerManager
│   │   ├── websocket/    # TerminalServer, RaceServer
│   │   └── db/           # SQLite schema
│   └── package.json
├── docker/               # Docker images
│   ├── elide-builder.Dockerfile
│   ├── standard-builder.Dockerfile
│   ├── CLAUDE.md         # Instructions for autonomous builds
│   └── build-images.sh
├── scripts/              # Utility scripts
│   └── dump-terminal-output.ts
├── docs/                 # Documentation
│   └── OBSERVABILITY.md
└── tests/                # Playwright tests
    ├── terminal-test.spec.ts
    └── claude-autonomous-test.spec.ts

Environment Setup

Prerequisites

Node.js 20+
pnpm
Docker Desktop
Anthropic API key

Installation

cd /Users/rwalters/GitHub/elide/tools/build-arena

# Install dependencies
pnpm install

# Set up environment
echo "ANTHROPIC_API_KEY=your-key-here" > backend/.env

# Build Docker images
cd docker && ./build-images.sh

# Initialize database
pnpm --filter @elide/build-arena-backend db:push

# Start services
pnpm dev

Technology Stack

Frontend: React 18, Vite, xterm.js, Recharts
Backend: Node.js, Express, SQLite (Drizzle ORM), WebSocket (ws)
Docker: Multi-platform images, Bash PTY sessions
AI: Claude Code CLI 2.0.30, Anthropic API
Testing: Playwright

Known Issues / Roadmap

Known Issues

Claude Code premature exit - Sometimes exits after thinking without requesting commands. Investigating API timeout/error handling.
Resource cleanup - Orphaned containers if backend crashes during race.

Roadmap

[ ] Minder status API endpoint for real-time state inspection
[ ] WebSocket recorder replay API for complete message history
[ ] Auto-restart Claude Code if it exits prematurely
[ ] Resource usage metrics (CPU, memory, disk I/O)
[ ] Multi-repository batch benchmarking
[ ] Leaderboard for popular repositories
[ ] GitHub Actions integration for CI benchmarking

Security Considerations

Docker containers are isolated with read-only filesystems where appropriate
API key stored in environment variables, not committed to repo
WebSocket connections validated with container ID checks
Build instructions limit Claude Code to repo cloning and building only

Performance

Concurrent races: Supports multiple simultaneous races
Resource limits: Docker containers have memory/CPU limits
Database: SQLite for lightweight persistence
WebSocket: Efficient binary protocol for terminal streaming

Related Issues

Addresses #1106 (Nomad integration) by providing infrastructure for autonomous build testing and performance benchmarking.

Draft Status

This PR is marked as draft for initial team review. Specifically looking for feedback on:

Architecture - Is the container/minder/WebSocket design sound?
Observability - Are the debugging tools sufficient?
AI Agent Behavior - Race minder approval patterns and error handling
UI/UX - Dashboard layout and real-time updates
Documentation - Clarity and completeness

Ready for initial review of the complete Build Arena system. The core functionality works end-to-end. Primary focus areas:

Race minder detection patterns (workspace trust working great!)
Observability tools for debugging
Docker image configuration
Frontend real-time updates

Nov 15 '25 00:11 rjwalters

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff	Package	Supply Chain Security	Vulnerability	Quality	Maintenance	License
	npm/react-router-dom@7.9.5
	npm/@types/uuid@9.0.8
	npm/@types/express@4.17.25
	npm/@xterm/addon-fit@0.10.0
	npm/@types/ws@8.18.1
	npm/cors@2.8.5
	npm/@types/react-dom@18.3.7
	npm/@types/dockerode@3.3.45
	npm/@types/react@18.3.26
	npm/@types/node@20.19.24
	npm/@types/cors@2.8.19
	npm/tsx@4.20.6
	npm/vite@5.4.21
	npm/uuid@9.0.1
	npm/node-fetch@3.3.2
	npm/autoprefixer@10.4.21
	npm/ws@8.18.3
	npm/dockerode@4.0.9
	npm/tailwindcss@3.4.18
	npm/drizzle-orm@0.44.7
	npm/@libsql/client@0.15.15
	npm/swr@2.3.6
	npm/@biomejs/biome@1.9.4
	npm/@vitejs/plugin-react@4.7.0
	npm/drizzle-kit@0.31.6
	npm/@xterm/xterm@5.5.0
	npm/zod@3.25.76
	npm/@playwright/test@1.56.1

View full report

Nov 15 '25 00:11 socket-security[bot]

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 42.94%. Comparing base (c8c853d) to head (8ec8275).

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1755   +/-   ##
=======================================
  Coverage   42.94%   42.94%           
=======================================
  Files         895      895           
  Lines       42415    42415           
  Branches     5959     5959           
=======================================
  Hits        18216    18216           
  Misses      21997    21997           
  Partials     2202     2202

Flag	Coverage Δ
jvm	`42.94% <ø> (ø)`
lib	`42.94% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update c8c853d...8ec8275. Read the comment docs.

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
:package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Nov 15 '25 00:11 codecov[bot]

feat(build-arena): AI-powered build performance benchmark system

Overview

Demo

What's Included

🏗️ Core Infrastructure

Backend (tools/build-arena/backend/)

Frontend (tools/build-arena/frontend/)

Docker Images (tools/build-arena/docker/)

🤖 Autonomous AI Agents

Race Minder (backend/src/services/race-minder.ts)

Container Instructions (docker/CLAUDE.md)

📊 Observability & Debugging

Terminal Output Dumper (scripts/dump-terminal-output.ts)

Comprehensive Documentation (docs/OBSERVABILITY.md)

Backend Log Filtering

🎨 UI Features

Race View

Build Metrics

Architecture

Key Workflows

Starting a Race

Monitoring with Observability Tools

Testing

Manual Testing

Playwright Tests

Project Structure

Environment Setup

Prerequisites

Installation

Technology Stack

Known Issues / Roadmap

Known Issues

Roadmap

Security Considerations

Performance

Related Issues

Draft Status

Codecov Report

Backend (`tools/build-arena/backend/`)

Frontend (`tools/build-arena/frontend/`)

Docker Images (`tools/build-arena/docker/`)

Race Minder (`backend/src/services/race-minder.ts`)

Container Instructions (`docker/CLAUDE.md`)

Terminal Output Dumper (`scripts/dump-terminal-output.ts`)

Comprehensive Documentation (`docs/OBSERVABILITY.md`)