EvalView — Catch Agent Regressions Before You Ship

Your agent worked yesterday. Today it's broken. What changed?

EvalView catches agent regressions — tool changes, output changes, cost spikes, and latency spikes — before they hit production.

evalview run --diff  # Compare against golden baseline, block on regression

EvalView Demo

Like what you see? ⭐ Star the repo — helps others discover it.

New: Interactive Chat Mode

Don't remember commands? Just ask.

evalview chat

EvalView Chat Demo

Ask in plain English. Get answers. Run commands. Analyze results.

"How do I test my Goose agent?"
"Show me what adapters are available"
"Run the regression demo"

Free & local — powered by Ollama. No API key needed.

# Install Ollama, then:
evalview chat                     # Auto-detects Ollama
evalview chat --provider openai   # Or use cloud models
evalview chat --demo              # Watch a scripted demo

The Problem

You changed a prompt. Or swapped models. Or updated a tool.

Now your agent:

❌ Calls different tools than before
❌ Returns different outputs for the same input
❌ Costs 3x more than yesterday
❌ Takes 5 seconds instead of 500ms

You don't find out until users complain.

The Solution

EvalView detects these regressions in CI — before you deploy.

# Save a working run as your baseline
evalview golden save .evalview/results/xxx.json

# Every future run compares against it
evalview run --diff  # Fails on REGRESSION

Who is EvalView for?

Builders shipping tool-using agents who keep breaking behavior when they change prompts, models, or tools.

You're iterating fast on prompts and models
You've broken your agent more than once after "just a small change"
You want CI to catch regressions, not your users

Already using LangSmith, Langfuse, or other tracing? Use them to see what happened. Use EvalView to block bad behavior before it ships.

Your Claude Code skills might be broken. Claude silently ignores skills that exceed its 15k char budget. Check yours →

What EvalView Catches

Regression Type	What It Means	Status
REGRESSION	Score dropped — agent got worse	🔴 Fix before deploy
TOOLS_CHANGED	Agent uses different tools now	🟡 Review before deploy
OUTPUT_CHANGED	Same tools, different response	🟡 Review before deploy
PASSED	Matches baseline	🟢 Ship it

EvalView runs in CI. When it detects a regression, your deploy fails. You fix it before users see it.

What is EvalView?

EvalView is a regression testing framework for AI agents.

It lets you:

Save golden baselines — snapshot a working agent run
Detect regressions automatically — tool changes, output changes, cost spikes, latency spikes
Block bad deploys in CI — fail the build when behavior regresses
Plug into LangGraph, CrewAI, OpenAI Assistants, Anthropic Claude, MCP servers, and more

Think: "Regression testing for agents. Like screenshot testing, but for behavior."

Note: LLM-as-judge evaluations are probabilistic. Results may vary between runs. Use Statistical Mode for reliable pass/fail decisions.

Core Workflow

# 1. Run tests and capture a baseline
evalview run
evalview golden save .evalview/results/latest.json

# 2. Make changes to your agent (prompt, model, tools)

# 3. Run with diff to catch regressions
evalview run --diff

# 4. CI integration with configurable strictness
evalview run --diff --fail-on REGRESSION                    # Default: only fail on score drops
evalview run --diff --fail-on REGRESSION,TOOLS_CHANGED      # Stricter: also fail on tool changes
evalview run --diff --strict                                # Strictest: fail on any change

Exit codes:

Scenario	Exit Code
All tests pass, all PASSED	0
All tests pass, only warn-on statuses	0 (with warnings)
Any test fails OR any fail-on status	1
Execution errors (network, timeout)	2

EvalView vs Manual Testing

	Manual Testing	EvalView
Catches hallucinations	No	Yes
Tracks token cost	No	Automatic
Runs in CI/CD	Hard	Built-in
Detects regressions	No	Golden traces + `--diff`
Tests tool calls	Manual inspection	Automated
Flexible tool matching	Exact names only	Categories (intent-based)
Latency tracking	No	Per-test thresholds
Handles flaky LLMs	No	Statistical mode

3 Copy-Paste Recipes

Budget regression test — fail if cost exceeds threshold:

name: "Cost check"
input:
  query: "Summarize this document"
thresholds:
  min_score: 70
  max_cost: 0.05

Tool-call required test — fail if agent doesn't use the tool:

name: "Must use search"
input:
  query: "What's the weather in NYC?"
expected:
  tools:
    - web_search
thresholds:
  min_score: 80

Hallucination check — fail if agent makes things up:

name: "No hallucinations"
input:
  query: "What's our refund policy?"
expected:
  tools:
    - retriever
thresholds:
  min_score: 80
checks:
  hallucination: true

Regression detection — fail if behavior changes from baseline:

# Save a good run as baseline
evalview golden save .evalview/results/xxx.json

# Future runs compare against it
evalview run --diff  # Fails on REGRESSION or TOOLS_CHANGED

Try it in 2 minutes (no DB required)

You don't need a database, Docker, or any extra infra to start.

# Install
pip install evalview

# Set your OpenAI API key (for LLM-as-judge evaluation)
export OPENAI_API_KEY='your-key-here'

# Run the quickstart – creates a demo agent, a test case, and runs everything
evalview quickstart

You'll see a full run with:

A demo agent spinning up
A test case created for you
A config file wired up
A scored test: tools used, output quality, cost, latency

Run examples directly (no config needed)

Test cases with adapter and endpoint defined work without any setup:

# Run any example directly
evalview run examples/langgraph/test-case.yaml
evalview run examples/ollama/langgraph-ollama-test.yaml

# Your own test case with adapter/endpoint works the same way
evalview run my-test.yaml

Free local evaluation with Ollama

Don't want to pay for API calls? Use Ollama for free local LLM-as-judge:

# Install Ollama and pull a model
curl -fsSL https://ollama.ai/install.sh | sh
ollama pull llama3.2

# Run tests with free local evaluation
evalview run --judge-provider ollama --judge-model llama3.2

No API key needed. Runs entirely on your machine.

📺 Example quickstart output

━━━ EvalView Quickstart ━━━

Step 1/4: Creating demo agent...
✅ Demo agent created

Step 2/4: Creating test case...
✅ Test case created

Step 3/4: Creating config...
✅ Config created

Step 4/4: Starting demo agent and running test...
✅ Demo agent running

Running test...

Test Case: Quickstart Test
Score: 95.0/100
Status: ✅ PASSED

Tool Accuracy: 100%
  Expected tools:  calculator
  Used tools:      calculator

Output Quality: 90/100

Performance:
  Cost:    $0.0010
  Latency: 27ms

🎉 Quickstart complete!

Useful? ⭐ Star the repo — takes 1 second, helps us a lot.

Add to CI in 60 seconds

# .github/workflows/evalview.yml
name: Agent Tests
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hidai25/[email protected]
        with:
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}

That's it. Tests run on every PR, block merges on failure.

Looking for Design Partners

Using EvalView on a real agent? I'm looking for 3-5 early adopters.

I'll personally help you set up YAML tests + CI integration in exchange for feedback on what's missing.

Email: [email protected]
Open a GitHub Discussion

No pitch, just want to learn what's broken and make it work for real use cases.

Do I need a database?

No.

By default, EvalView runs in a basic, no-DB mode:

No external database
Tests run in memory
Results are printed in a rich terminal UI

You can still use it locally and in CI (exit codes + JSON reports).

That's enough to:

Write and debug tests for your agents
Add a "fail the build if this test breaks" check to CI/CD

If you later want history, dashboards, or analytics, you can plug in a database and turn on the advanced features:

Store all runs over time
Compare behavior across branches / releases
Track cost / latency trends
Generate HTML reports for your team

Database config is optional – EvalView only uses it if you enable it in config.

Why EvalView?

Fully Open Source – Apache 2.0 licensed, runs entirely on your infra, no SaaS lock-in
Framework-agnostic – Works with LangGraph, CrewAI, OpenAI, Anthropic, or any HTTP API
Production-ready – Parallel execution, CI/CD integration, configurable thresholds
Extensible – Custom adapters, evaluators, and reporters for your stack

Behavior Coverage (not line coverage)

Line coverage doesn't work for LLMs. Instead, EvalView focuses on behavior coverage:

Dimension	What it measures
Tasks covered	Which real-world scenarios have tests?
Tools exercised	Are all your agent's tools being tested?
Paths hit	Are multi-step workflows tested end-to-end?
Eval dimensions	Are you checking correctness, safety, cost, latency?

The loop: weird prod session → turn it into a regression test → it shows up in your coverage.

# Compact summary with deltas vs last run + regression detection
evalview run --summary

━━━ EvalView Summary ━━━
Suite: analytics_agent
Tests: 7 passed, 2 failed

Failures:
  ✗ cohort: large result set     cost +240%
  ✗ doc QA: long context         missing tool: chunking

Deltas vs last run:
  Tokens:  +188%  ↑
  Latency: +95ms  ↑
  Cost:    +$0.12 ↑

⚠️  Regressions detected

# Behavior coverage report
evalview run --coverage

━━━ Behavior Coverage ━━━
Suite: analytics_agent

Tasks:      9/9 scenarios (100%)
Tools:      6/8 exercised (75%)
            missing: chunking, summarize
Paths:      3/3 multi-step workflows (100%)
Dimensions: correctness ✓, output ✓, cost ✗, latency ✓, safety ✓

Overall:    92% behavior coverage

Golden Traces (Regression Detection)

Problem: Your agent worked yesterday. Today it doesn't. What changed?

Solution: Save "golden" baselines, detect regressions automatically.

How It Works

# 1. Run your tests
evalview run

# 2. Save a passing run as your golden baseline
evalview golden save .evalview/results/20241201_143022.json

# 3. On future runs, compare against golden
evalview run --diff

When you run with --diff, EvalView compares every test against its golden baseline and flags:

Status	What It Means	Action
PASSED	Matches baseline	🟢 Ship it
TOOLS_CHANGED	Agent uses different tools	🟡 Review before deploy
OUTPUT_CHANGED	Same tools, different response	🟡 Review before deploy
REGRESSION	Score dropped significantly	🔴 Fix before deploy

Example Output

━━━ Golden Diff Report ━━━

✓ PASSED           test-stock-analysis
⚠ TOOLS_CHANGED    test-customer-support    added: web_search
~ OUTPUT_CHANGED   test-summarizer          similarity: 78%
✗ REGRESSION       test-code-review         score dropped 15 points

1 REGRESSION - fix before deploy
1 TOOLS_CHANGED - review before deploy

Golden Commands

# Save a result as golden baseline
evalview golden save .evalview/results/xxx.json

# Save with notes
evalview golden save result.json --notes "Baseline after v2.0 refactor"

# Save only specific test from a multi-test result
evalview golden save result.json --test "stock-analysis"

# List all golden traces
evalview golden list

# Show details of a golden trace
evalview golden show test-stock-analysis

# Delete a golden trace
evalview golden delete test-stock-analysis

Use case: Add evalview run --diff to CI. Block deploys when behavior regresses.

Tool Categories (Flexible Matching)

Problem: Your test expects read_file. Agent uses bash cat. Test fails. Both are correct.

Solution: Test by intent, not exact tool name.

Before (Brittle)

expected:
  tools:
    - read_file      # Fails if agent uses bash, text_editor, etc.

After (Flexible)

expected:
  categories:
    - file_read      # Passes for read_file, bash cat, text_editor, etc.

Built-in Categories

Category	Matches
`file_read`	read_file, bash, text_editor, cat, view, str_replace_editor
`file_write`	write_file, bash, text_editor, edit_file, create_file
`file_list`	list_directory, bash, ls, find, directory_tree
`search`	grep, ripgrep, bash, search_files, code_search
`shell`	bash, shell, terminal, execute, run_command
`web`	web_search, browse, fetch_url, http_request, curl
`git`	git, bash, git_commit, git_push, github
`python`	python, bash, python_repl, execute_python, jupyter

Custom Categories

Add project-specific categories in config.yaml:

# .evalview/config.yaml
tool_categories:
  database:
    - postgres_query
    - mysql_execute
    - sql_run
  my_custom_api:
    - internal_api_call
    - legacy_endpoint

Why this matters: Different agents use different tools for the same task. Categories let you test behavior, not implementation.

What it does (in practice)

Write test cases in YAML – Define inputs, required tools, and scoring thresholds
Automated evaluation – Tool accuracy, output quality (LLM-as-judge), hallucination checks, cost, latency
Run in CI/CD – JSON/HTML reports + proper exit codes for blocking deploys

# tests/test-cases/stock-analysis.yaml
name: "Stock Analysis Test"
input:
  query: "Analyze Apple stock performance"

expected:
  tools:
    - fetch_stock_data
    - analyze_metrics
  output:
    contains:
      - "revenue"
      - "earnings"

thresholds:
  min_score: 80
  max_cost: 0.50
  max_latency: 5000

$ evalview run

✅ Stock Analysis Test - PASSED (score: 92.5)
   Cost: $0.0234 | Latency: 3.4s

Generate 1000 Tests from 1

Problem: Writing tests manually is slow. You need volume to catch regressions.

Solution: Auto-generate test variations.

Option 1: Expand from existing tests

# Take 1 test, generate 100 variations
evalview expand tests/stock-test.yaml --count 100

# Focus on specific scenarios
evalview expand tests/stock-test.yaml --count 50 \
  --focus "different tickers, edge cases, error scenarios"

Generates variations like:

Different inputs (AAPL → MSFT, GOOGL, TSLA...)
Edge cases (invalid tickers, empty input, malformed requests)
Boundary conditions (very long queries, special characters)

Option 2: Record from live interactions

# Use your agent normally, auto-generate tests
evalview record --interactive

EvalView captures:

Query → Tools called → Output
Auto-generates test YAML
Adds reasonable thresholds

Result: Go from 5 manual tests → 500 comprehensive tests in minutes.

Connect to your agent

Already have an agent running? Use evalview connect to auto-detect it:

# Start your agent (LangGraph, CrewAI, whatever)
langgraph dev

# Auto-detect and connect
evalview connect  # Scans ports, detects framework, configures everything

# Run tests
evalview run

Supports 7+ frameworks with automatic detection: LangGraph • CrewAI • OpenAI Assistants • Anthropic Claude • AutoGen • Dify • Custom APIs

EvalView Cloud (Coming Soon)

We're building a hosted version:

Dashboard - Visual test history, trends, and pass/fail rates
Teams - Share results and collaborate on fixes
Alerts - Slack/Discord notifications on failures
Regression detection - Automatic alerts when performance degrades
Parallel runs - Run hundreds of tests in seconds

Join the waitlist - be first to get access

Features

Golden traces - Save baselines, detect regressions with --diff (docs)
Tool categories - Flexible matching by intent, not exact tool names (docs)
Test Expansion - Generate 100+ test variations from a single seed test
Test Recording - Auto-generate tests from live agent interactions
YAML-based test cases - Write readable, maintainable test definitions
Parallel execution - Run tests concurrently (8x faster by default)
Multiple evaluation metrics - Tool accuracy, sequence correctness, output quality, cost, and latency
LLM-as-judge - Automated output quality assessment
Cost tracking - Automatic cost calculation based on token usage
Universal adapters - Works with any HTTP or streaming API
Rich console output - Beautiful, informative test results
JSON & HTML reports - Interactive HTML reports with Plotly charts
Retry logic - Automatic retries with exponential backoff for flaky tests
Watch mode - Re-run tests automatically on file changes
Configurable weights - Customize scoring weights globally or per-test
Statistical mode - Run tests N times, get variance metrics and flakiness scores
Skills testing - Validate and test Claude Code / OpenAI Codex skills against official Anthropic spec

Installation

# Install (includes skills testing)
pip install evalview

# With HTML reports (Plotly charts)
pip install evalview[reports]

# With watch mode
pip install evalview[watch]

# All optional features
pip install evalview[all]

CLI Reference

`evalview quickstart`

The fastest way to try EvalView. Creates a demo agent, test case, and runs everything.

`evalview run`

Run test cases.

evalview run [OPTIONS]

Options:
  --pattern TEXT         Test case file pattern (default: *.yaml)
  -t, --test TEXT        Run specific test(s) by name
  --diff                 Compare against golden traces, detect regressions
  --verbose              Enable verbose logging
  --sequential           Run tests one at a time (default: parallel)
  --max-workers N        Max parallel executions (default: 8)
  --max-retries N        Retry flaky tests N times (default: 0)
  --watch                Re-run tests on file changes
  --html-report PATH     Generate interactive HTML report
  --summary              Compact output with deltas vs last run + regression detection
  --coverage             Show behavior coverage: tasks, tools, paths, eval dimensions
  --judge-model TEXT     Model for LLM-as-judge (e.g., gpt-5, sonnet, llama-70b)
  --judge-provider TEXT  Provider for LLM-as-judge (openai, anthropic, huggingface, gemini, grok, ollama)

Model shortcuts - Use simple names, they auto-resolve:

Shortcut	Full Model
`gpt-5`	`gpt-5`
`sonnet`	`claude-sonnet-4-5-20250929`
`opus`	`claude-opus-4-5-20251101`
`llama-70b`	`meta-llama/Llama-3.1-70B-Instruct`
`gemini`	`gemini-3.0`
`llama3.2`	`llama3.2` (Ollama)

# Examples
evalview run --judge-model gpt-5 --judge-provider openai
evalview run --judge-model sonnet --judge-provider anthropic
evalview run --judge-model llama-70b --judge-provider huggingface  # Free!
evalview run --judge-model llama3.2 --judge-provider ollama  # Free & Local!

`evalview expand`

Generate test variations from a seed test case.

evalview expand TEST_FILE --count 100 --focus "edge cases"

`evalview record`

Record agent interactions and auto-generate test cases.

evalview record --interactive

`evalview report`

Generate report from results.

evalview report .evalview/results/20241118_004830.json --detailed --html report.html

`evalview golden`

Manage golden traces for regression detection.

# Save a test result as the golden baseline
evalview golden save .evalview/results/xxx.json
evalview golden save result.json --notes "Post-refactor baseline"
evalview golden save result.json --test "specific-test-name"

# List all golden traces
evalview golden list

# Show details of a golden trace
evalview golden show test-name

# Delete a golden trace
evalview golden delete test-name
evalview golden delete test-name --force

Statistical Mode (Variance Testing)

LLMs are non-deterministic. A test that passes once might fail the next run. Statistical mode addresses this by running tests multiple times and using statistical thresholds for pass/fail decisions.

Enable Statistical Mode

Add variance config to your test case:

# tests/test-cases/my-test.yaml
name: "My Agent Test"
input:
  query: "Analyze the market trends"

expected:
  tools:
    - fetch_data
    - analyze

thresholds:
  min_score: 70

  # Statistical mode config
  variance:
    runs: 10           # Run test 10 times
    pass_rate: 0.8     # 80% of runs must pass
    min_mean_score: 70 # Average score must be >= 70
    max_std_dev: 15    # Score std dev must be <= 15

What You Get

Pass rate - Percentage of runs that passed
Score statistics - Mean, std dev, min/max, percentiles, confidence intervals
Flakiness score - 0 (stable) to 1 (flaky) with category labels
Contributing factors - Why the test is flaky (score variance, tool inconsistency, etc.)

Example Output

Statistical Evaluation: My Agent Test
PASSED

┌─ Run Summary ─────────────────────────┐
│  Total Runs:     10                   │
│  Passed:         8                    │
│  Failed:         2                    │
│  Pass Rate:      80% (required: 80%)  │
└───────────────────────────────────────┘

Score Statistics:
  Mean:      79.86    95% CI: [78.02, 81.70]
  Std Dev:   2.97     ▂▂▁▁▁ Low variance
  Min:       75.5
  Max:       84.5

┌─ Flakiness Assessment ────────────────┐
│  Flakiness Score: 0.12 ██░░░░░░░░     │
│  Category:        low_variance        │
│  Pass Rate:       80%                 │
└───────────────────────────────────────┘

See examples/statistical-mode-example.yaml for a complete example.

Evaluation Metrics

Metric	Weight	Description
Tool Accuracy	30%	Checks if expected tools were called
Output Quality	50%	LLM-as-judge evaluation
Sequence Correctness	20%	Validates exact tool call order
Cost Threshold	Pass/Fail	Must stay under `max_cost`
Latency Threshold	Pass/Fail	Must complete under `max_latency`

Weights are configurable globally or per-test.

CI/CD Integration

EvalView is CLI-first. You can run it locally or add to CI.

GitHub Action (Recommended)

Use the official EvalView GitHub Action for the simplest setup:

name: EvalView Agent Tests

on: [push, pull_request]

jobs:
  test-agents:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run EvalView
        uses: hidai25/[email protected]
        with:
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}
          max-workers: '4'
          fail-on-error: 'true'

Action Inputs

Input	Description	Default
`openai-api-key`	OpenAI API key for LLM-as-judge	-
`anthropic-api-key`	Anthropic API key (optional)	-
`config-path`	Path to config file	`.evalview/config.yaml`
`filter`	Filter tests by name pattern	-
`max-workers`	Parallel workers	`4`
`max-retries`	Retry failed tests	`2`
`fail-on-error`	Fail workflow on test failure	`true`
`generate-report`	Generate HTML report	`true`
`python-version`	Python version	`3.11`

Action Outputs

Output	Description
`results-file`	Path to JSON results
`report-file`	Path to HTML report
`total-tests`	Total tests run
`passed-tests`	Passed count
`failed-tests`	Failed count
`pass-rate`	Pass rate percentage

Full Example with PR Comments

name: EvalView Agent Tests

on:
  pull_request:
    branches: [main]

jobs:
  test-agents:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run EvalView
        id: evalview
        uses: hidai25/[email protected]
        with:
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}

      - name: Upload results
        uses: actions/upload-artifact@v4
        with:
          name: evalview-results
          path: |
            .evalview/results/*.json
            evalview-report.html

      - name: Comment on PR
        uses: actions/github-script@v7
        with:
          script: |
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: `## EvalView Results\n\n✅ ${`${{ steps.evalview.outputs.passed-tests }}`}/${`${{ steps.evalview.outputs.total-tests }}`} tests passed (${`${{ steps.evalview.outputs.pass-rate }}`}%)`
            });

Manual Setup (Alternative)

If you prefer manual setup:

name: EvalView Agent Tests

on: [push, pull_request]

jobs:
  evalview:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install evalview
      - run: evalview run --pattern "tests/test-cases/*.yaml"
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Architecture

evalview/
├── adapters/           # Agent communication (HTTP, OpenAI, Anthropic, etc.)
├── evaluators/         # Evaluation logic (tools, output, cost, latency)
├── reporters/          # Output formatting (console, JSON, HTML)
├── core/               # Types, config, parallel execution
└── cli.py              # Click CLI

Guides

Guide	Description
Testing LangGraph Agents in CI	Set up automated testing for LangGraph agents with GitHub Actions
Detecting LLM Hallucinations	Catch hallucinations and made-up facts before they reach users

Topic	Description
Getting Started	5-minute quickstart guide
Framework Support	Supported frameworks and compatibility
Cost Tracking	Token usage and cost calculation
Debugging Guide	Troubleshooting common issues
Adapters	Building custom adapters

Examples

LangGraph Integration - Test LangGraph agents
CrewAI Integration - Test CrewAI agents
Anthropic Claude - Test Claude API and Claude Agent SDK
Dify Workflows - Test Dify AI workflows
Ollama (Local LLMs) - Test with local Llama models + free local evaluation

Using Node.js / Next.js? See @evalview/node for drop-in middleware.

Skills Testing (Claude Code & OpenAI Codex)

Your Skills Are Probably Broken. Claude Is Ignoring Them.

Common symptoms:

Skills installed but never trigger
Claude says "I don't have that skill"
Works locally, breaks in production
No errors, just... silence

Why it happens: Claude Code has a 15k character budget for skill descriptions. Exceed it and skills aren't loaded. No warning. No error. EvalView catches this.

EvalView catches this before you waste hours debugging:

30 Seconds: Validate Your Skill

pip install evalview
evalview skill validate ./SKILL.md

That's it. Catches naming errors, missing fields, reserved words, and spec violations.

Try it now with the included example:

evalview skill validate examples/skills/test-skill/SKILL.md

Why Is Claude Ignoring My Skills?

Run the doctor to find out:

evalview skill doctor ~/.claude/skills/

⚠️  Character Budget: 127% OVER - Claude is ignoring 4 of your 24 skills

ISSUE: Character budget exceeded
  Claude Code won't see all your skills.
  Fix: Set SLASH_COMMAND_TOOL_CHAR_BUDGET=30000 or reduce descriptions

ISSUE: Duplicate skill names
  code-reviewer defined in:
    - ~/.claude/skills/old/SKILL.md
    - ~/.claude/skills/new/SKILL.md

✗ 4 skills are INVISIBLE to Claude - fix now

This is why your skills "don't work." Claude literally can't see them.

2 Minutes: Add Behavior Tests + CI

1. Create a test file next to your SKILL.md:

# tests.yaml
name: my-skill-tests
skill: ./SKILL.md

tests:
  - name: basic-test
    input: "Your test prompt"
    expected:
      output_contains: ["expected", "words"]

2. Run locally

echo "ANTHROPIC_API_KEY=your-key" > .env.local
evalview skill test tests.yaml

3. Add to CI — copy examples/skills/test-skill/.github/workflows/skill-tests.yml to your repo

Starter template: See examples/skills/test-skill/ for a complete copy-paste example with GitHub Actions.

Validate Skill Structure

Catch errors before Claude ever sees your skill:

# Validate a single skill
evalview skill validate ./my-skill/SKILL.md

# Validate all skills in a directory
evalview skill validate ~/.claude/skills/ -r

# CI-friendly JSON output
evalview skill validate ./skills/ -r --json

Validates against official Anthropic spec:

name: max 64 chars, lowercase/numbers/hyphens only, no reserved words ("anthropic", "claude")
description: max 1024 chars, non-empty, no XML tags
Token size (warns if >5k tokens)
Policy compliance (no prompt injection patterns)
Best practices (examples, guidelines sections)

━━━ Skill Validation Results ━━━

✓ skills/code-reviewer/SKILL.md
   Name: code-reviewer
   Tokens: ~2,400

✓ skills/doc-writer/SKILL.md
   Name: doc-writer
   Tokens: ~1,800

✗ skills/broken/SKILL.md
   ERROR [MISSING_DESCRIPTION] Skill description is required

Summary: 2 valid, 1 invalid

Test Skill Behavior

Validation catches syntax errors. Behavior tests catch logic errors.

Define what your skill should do, then verify it actually does it:

# tests/code-reviewer.yaml
name: test-code-reviewer
skill: ./skills/code-reviewer/SKILL.md

tests:
  - name: detects-sql-injection
    input: |
      Review this code:
      query = f"SELECT * FROM users WHERE id = {user_id}"
    expected:
      output_contains: ["SQL injection", "parameterized"]
      output_not_contains: ["looks good", "no issues"]

  - name: approves-safe-code
    input: |
      Review this code:
      query = db.execute("SELECT * FROM users WHERE id = ?", [user_id])
    expected:
      output_contains: ["secure", "parameterized"]
      output_not_contains: ["vulnerability", "injection"]

Run it:

# Option 1: Environment variable
export ANTHROPIC_API_KEY=your-key

# Option 2: Create .env.local file (auto-loaded)
echo "ANTHROPIC_API_KEY=your-key" > .env.local

# Run the tests
evalview skill test tests/code-reviewer.yaml

━━━ Running Skill Tests ━━━

Suite:  test-code-reviewer
Skill:  ./skills/code-reviewer/SKILL.md
Model:  claude-sonnet-4-20250514
Tests:  2

Results:

  PASS detects-sql-injection
  PASS approves-safe-code

Summary: ✓
  Pass rate: 100% (2/2)
  Avg latency: 1,240ms
  Total tokens: 3,847

Why Test Skills?

You can test skills manually in Claude Code. So why use EvalView?

Manual testing works for development. EvalView is for automation:

Manual Testing	EvalView
Test while you write	Test on every commit
You remember to test	CI blocks bad merges
Test a few cases	Test 50+ scenarios
"It works for me"	Reproducible results
Catch bugs after publish	Catch bugs before publish

Who needs automated skill testing?

Skill authors publishing to marketplaces
Enterprise teams rolling out skills to thousands of employees
Open source maintainers accepting contributions from the community
Anyone who wants CI/CD for their skills

Skills are code. Code needs tests. EvalView brings the rigor of software testing to the AI skills ecosystem.

Compatible With

Platform	Status
Claude Code	Supported
Claude.ai Skills	Supported
OpenAI Codex CLI	Same SKILL.md format
Custom Skills	Any SKILL.md file

Like what you see?

If EvalView caught a regression, saved you debugging time, or kept your agent costs in check — give it a ⭐ star to help others discover it.

Roadmap

Shipped:

[x] Golden traces & regression detection (evalview run --diff)
[x] Tool categories for flexible matching
[x] Multi-run flakiness detection
[x] Skills testing (Claude Code, OpenAI Codex)
[x] MCP server testing (adapter: mcp)
[x] HTML diff reports (--diff-report)

Coming Soon:

[ ] Multi-turn conversation testing
[ ] Grounded hallucination checking
[ ] LLM-as-judge for skill guideline compliance
[ ] Error compounding metrics

Want these? Vote in GitHub Discussions

FAQ

Does EvalView work with LangChain / LangGraph? Yes. Use the langgraph adapter. See examples/langgraph/.

Does EvalView work with CrewAI? Yes. Use the crewai adapter. See examples/crewai/.

Does EvalView work with OpenAI Assistants? Yes. Use the openai-assistants adapter.

Does EvalView work with Anthropic Claude? Yes. Use the anthropic adapter. See examples/anthropic/.

How much does it cost? EvalView is free and open source. You pay only for LLM API calls (for LLM-as-judge evaluation). Use Ollama for free local evaluation.

Can I use it without an API key? Yes. Use Ollama for free local LLM-as-judge: evalview run --judge-provider ollama --judge-model llama3.2

Can I run EvalView in CI/CD? Yes. EvalView has a GitHub Action and proper exit codes. See CI/CD Integration.

Does EvalView require a database? No. EvalView runs without any database by default. Results print to console and save as JSON.

How is EvalView different from LangSmith? LangSmith is for tracing/observability. EvalView is for testing. Use both: LangSmith to see what happened, EvalView to block bad behavior before prod.

Can I test for hallucinations? Yes. EvalView has built-in hallucination detection that compares agent output against tool results.

Can I test Claude Code skills? Yes. Use evalview skill validate for structure checks and evalview skill test for behavior tests. See Skills Testing.

Does EvalView work with OpenAI Codex CLI skills? Yes. Codex CLI uses the same SKILL.md format as Claude Code. Your tests work for both.

Do I need an API key for skill validation? No. evalview skill validate runs locally without any API calls. Only evalview skill test requires an Anthropic API key.

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

See CONTRIBUTING.md for guidelines.

License

EvalView is open source software licensed under the Apache License 2.0.

Support

Issues: https://github.com/hidai25/eval-view/issues
Discussions: https://github.com/hidai25/eval-view/discussions

Affiliations

EvalView is an independent open-source project and is not affiliated with, endorsed by, or sponsored by LangGraph, CrewAI, OpenAI, Anthropic, or any other third party mentioned. All product names, logos, and brands are property of their respective owners.

Ship AI agents with confidence.

eval-view eval-view copied to clipboard

Metadata

EvalView — Catch Agent Regressions Before You Ship

New: Interactive Chat Mode

The Problem

The Solution

What EvalView Catches

What is EvalView?

Core Workflow

EvalView vs Manual Testing

3 Copy-Paste Recipes

Try it in 2 minutes (no DB required)

Run examples directly (no config needed)

Free local evaluation with Ollama

Add to CI in 60 seconds

Looking for Design Partners

Do I need a database?

Why EvalView?

Behavior Coverage (not line coverage)

Golden Traces (Regression Detection)

How It Works

Example Output

Golden Commands

Tool Categories (Flexible Matching)

Before (Brittle)

After (Flexible)

Built-in Categories

Custom Categories

What it does (in practice)

Generate 1000 Tests from 1

Option 1: Expand from existing tests

Option 2: Record from live interactions

Connect to your agent

EvalView Cloud (Coming Soon)

Features

Installation

CLI Reference

evalview quickstart

evalview run

evalview expand

evalview record

evalview report

evalview golden

Statistical Mode (Variance Testing)

Enable Statistical Mode

What You Get

Example Output

Evaluation Metrics

CI/CD Integration

GitHub Action (Recommended)

Action Inputs

Action Outputs

Full Example with PR Comments

Manual Setup (Alternative)

Architecture

Guides

Further Reading

Examples

Skills Testing (Claude Code & OpenAI Codex)

Your Skills Are Probably Broken. Claude Is Ignoring Them.

30 Seconds: Validate Your Skill

Why Is Claude Ignoring My Skills?

2 Minutes: Add Behavior Tests + CI

Validate Skill Structure

Test Skill Behavior

Why Test Skills?

Compatible With

Like what you see?

Roadmap

FAQ

Contributing

License

Support

Affiliations

← Metadata

Owner

Metadata

eval-view
eval-view copied to clipboard

`evalview quickstart`

`evalview run`

`evalview expand`

`evalview record`

`evalview report`

`evalview golden`