eval-view
eval-view copied to clipboard
EvalView: pytest-style test harness for AI agents - YAML scenarios, tool-call checks, cost/latency & safety evals, CI-friendly reports
EvalView — Catch Agent Regressions Before You Ship
Your agent worked yesterday. Today it's broken. What changed?
EvalView catches agent regressions — tool changes, output changes, cost spikes, and latency spikes — before they hit production.
evalview run --diff # Compare against golden baseline, block on regression
Like what you see? ⭐ Star the repo — helps others discover it.
New: Interactive Chat Mode
Don't remember commands? Just ask.
evalview chat
Ask in plain English. Get answers. Run commands. Analyze results.
- "How do I test my Goose agent?"
- "Show me what adapters are available"
- "Run the regression demo"
Free & local — powered by Ollama. No API key needed.
# Install Ollama, then:
evalview chat # Auto-detects Ollama
evalview chat --provider openai # Or use cloud models
evalview chat --demo # Watch a scripted demo
The Problem
You changed a prompt. Or swapped models. Or updated a tool.
Now your agent:
- ❌ Calls different tools than before
- ❌ Returns different outputs for the same input
- ❌ Costs 3x more than yesterday
- ❌ Takes 5 seconds instead of 500ms
You don't find out until users complain.
The Solution
EvalView detects these regressions in CI — before you deploy.
# Save a working run as your baseline
evalview golden save .evalview/results/xxx.json
# Every future run compares against it
evalview run --diff # Fails on REGRESSION
Who is EvalView for?
Builders shipping tool-using agents who keep breaking behavior when they change prompts, models, or tools.
- You're iterating fast on prompts and models
- You've broken your agent more than once after "just a small change"
- You want CI to catch regressions, not your users
Already using LangSmith, Langfuse, or other tracing? Use them to see what happened. Use EvalView to block bad behavior before it ships.
Your Claude Code skills might be broken. Claude silently ignores skills that exceed its 15k char budget. Check yours →
What EvalView Catches
| Regression Type | What It Means | Status |
|---|---|---|
| REGRESSION | Score dropped — agent got worse | 🔴 Fix before deploy |
| TOOLS_CHANGED | Agent uses different tools now | 🟡 Review before deploy |
| OUTPUT_CHANGED | Same tools, different response | 🟡 Review before deploy |
| PASSED | Matches baseline | 🟢 Ship it |
EvalView runs in CI. When it detects a regression, your deploy fails. You fix it before users see it.
What is EvalView?
EvalView is a regression testing framework for AI agents.
It lets you:
- Save golden baselines — snapshot a working agent run
- Detect regressions automatically — tool changes, output changes, cost spikes, latency spikes
- Block bad deploys in CI — fail the build when behavior regresses
- Plug into LangGraph, CrewAI, OpenAI Assistants, Anthropic Claude, MCP servers, and more
Think: "Regression testing for agents. Like screenshot testing, but for behavior."
Note: LLM-as-judge evaluations are probabilistic. Results may vary between runs. Use Statistical Mode for reliable pass/fail decisions.
Core Workflow
# 1. Run tests and capture a baseline
evalview run
evalview golden save .evalview/results/latest.json
# 2. Make changes to your agent (prompt, model, tools)
# 3. Run with diff to catch regressions
evalview run --diff
# 4. CI integration with configurable strictness
evalview run --diff --fail-on REGRESSION # Default: only fail on score drops
evalview run --diff --fail-on REGRESSION,TOOLS_CHANGED # Stricter: also fail on tool changes
evalview run --diff --strict # Strictest: fail on any change
Exit codes:
| Scenario | Exit Code |
|---|---|
| All tests pass, all PASSED | 0 |
| All tests pass, only warn-on statuses | 0 (with warnings) |
| Any test fails OR any fail-on status | 1 |
| Execution errors (network, timeout) | 2 |
EvalView vs Manual Testing
| Manual Testing | EvalView | |
|---|---|---|
| Catches hallucinations | No | Yes |
| Tracks token cost | No | Automatic |
| Runs in CI/CD | Hard | Built-in |
| Detects regressions | No | Golden traces + --diff |
| Tests tool calls | Manual inspection | Automated |
| Flexible tool matching | Exact names only | Categories (intent-based) |
| Latency tracking | No | Per-test thresholds |
| Handles flaky LLMs | No | Statistical mode |
3 Copy-Paste Recipes
Budget regression test — fail if cost exceeds threshold:
name: "Cost check"
input:
query: "Summarize this document"
thresholds:
min_score: 70
max_cost: 0.05
Tool-call required test — fail if agent doesn't use the tool:
name: "Must use search"
input:
query: "What's the weather in NYC?"
expected:
tools:
- web_search
thresholds:
min_score: 80
Hallucination check — fail if agent makes things up:
name: "No hallucinations"
input:
query: "What's our refund policy?"
expected:
tools:
- retriever
thresholds:
min_score: 80
checks:
hallucination: true
Regression detection — fail if behavior changes from baseline:
# Save a good run as baseline
evalview golden save .evalview/results/xxx.json
# Future runs compare against it
evalview run --diff # Fails on REGRESSION or TOOLS_CHANGED
Try it in 2 minutes (no DB required)
You don't need a database, Docker, or any extra infra to start.
# Install
pip install evalview
# Set your OpenAI API key (for LLM-as-judge evaluation)
export OPENAI_API_KEY='your-key-here'
# Run the quickstart – creates a demo agent, a test case, and runs everything
evalview quickstart
You'll see a full run with:
- A demo agent spinning up
- A test case created for you
- A config file wired up
- A scored test: tools used, output quality, cost, latency
Run examples directly (no config needed)
Test cases with adapter and endpoint defined work without any setup:
# Run any example directly
evalview run examples/langgraph/test-case.yaml
evalview run examples/ollama/langgraph-ollama-test.yaml
# Your own test case with adapter/endpoint works the same way
evalview run my-test.yaml
Free local evaluation with Ollama
Don't want to pay for API calls? Use Ollama for free local LLM-as-judge:
# Install Ollama and pull a model
curl -fsSL https://ollama.ai/install.sh | sh
ollama pull llama3.2
# Run tests with free local evaluation
evalview run --judge-provider ollama --judge-model llama3.2
No API key needed. Runs entirely on your machine.
📺 Example quickstart output
━━━ EvalView Quickstart ━━━
Step 1/4: Creating demo agent...
✅ Demo agent created
Step 2/4: Creating test case...
✅ Test case created
Step 3/4: Creating config...
✅ Config created
Step 4/4: Starting demo agent and running test...
✅ Demo agent running
Running test...
Test Case: Quickstart Test
Score: 95.0/100
Status: ✅ PASSED
Tool Accuracy: 100%
Expected tools: calculator
Used tools: calculator
Output Quality: 90/100
Performance:
Cost: $0.0010
Latency: 27ms
🎉 Quickstart complete!
Useful? ⭐ Star the repo — takes 1 second, helps us a lot.
Add to CI in 60 seconds
# .github/workflows/evalview.yml
name: Agent Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: hidai25/[email protected]
with:
openai-api-key: ${{ secrets.OPENAI_API_KEY }}
That's it. Tests run on every PR, block merges on failure.
Looking for Design Partners
Using EvalView on a real agent? I'm looking for 3-5 early adopters.
I'll personally help you set up YAML tests + CI integration in exchange for feedback on what's missing.
No pitch, just want to learn what's broken and make it work for real use cases.
Do I need a database?
No.
By default, EvalView runs in a basic, no-DB mode:
- No external database
- Tests run in memory
- Results are printed in a rich terminal UI
You can still use it locally and in CI (exit codes + JSON reports).
That's enough to:
- Write and debug tests for your agents
- Add a "fail the build if this test breaks" check to CI/CD
If you later want history, dashboards, or analytics, you can plug in a database and turn on the advanced features:
- Store all runs over time
- Compare behavior across branches / releases
- Track cost / latency trends
- Generate HTML reports for your team
Database config is optional – EvalView only uses it if you enable it in config.
Why EvalView?
- Fully Open Source – Apache 2.0 licensed, runs entirely on your infra, no SaaS lock-in
- Framework-agnostic – Works with LangGraph, CrewAI, OpenAI, Anthropic, or any HTTP API
- Production-ready – Parallel execution, CI/CD integration, configurable thresholds
- Extensible – Custom adapters, evaluators, and reporters for your stack
Behavior Coverage (not line coverage)
Line coverage doesn't work for LLMs. Instead, EvalView focuses on behavior coverage:
| Dimension | What it measures |
|---|---|
| Tasks covered | Which real-world scenarios have tests? |
| Tools exercised | Are all your agent's tools being tested? |
| Paths hit | Are multi-step workflows tested end-to-end? |
| Eval dimensions | Are you checking correctness, safety, cost, latency? |
The loop: weird prod session → turn it into a regression test → it shows up in your coverage.
# Compact summary with deltas vs last run + regression detection
evalview run --summary
━━━ EvalView Summary ━━━
Suite: analytics_agent
Tests: 7 passed, 2 failed
Failures:
✗ cohort: large result set cost +240%
✗ doc QA: long context missing tool: chunking
Deltas vs last run:
Tokens: +188% ↑
Latency: +95ms ↑
Cost: +$0.12 ↑
⚠️ Regressions detected
# Behavior coverage report
evalview run --coverage
━━━ Behavior Coverage ━━━
Suite: analytics_agent
Tasks: 9/9 scenarios (100%)
Tools: 6/8 exercised (75%)
missing: chunking, summarize
Paths: 3/3 multi-step workflows (100%)
Dimensions: correctness ✓, output ✓, cost ✗, latency ✓, safety ✓
Overall: 92% behavior coverage
Golden Traces (Regression Detection)
Problem: Your agent worked yesterday. Today it doesn't. What changed?
Solution: Save "golden" baselines, detect regressions automatically.
How It Works
# 1. Run your tests
evalview run
# 2. Save a passing run as your golden baseline
evalview golden save .evalview/results/20241201_143022.json
# 3. On future runs, compare against golden
evalview run --diff
When you run with --diff, EvalView compares every test against its golden baseline and flags:
| Status | What It Means | Action |
|---|---|---|
| PASSED | Matches baseline | 🟢 Ship it |
| TOOLS_CHANGED | Agent uses different tools | 🟡 Review before deploy |
| OUTPUT_CHANGED | Same tools, different response | 🟡 Review before deploy |
| REGRESSION | Score dropped significantly | 🔴 Fix before deploy |
Example Output
━━━ Golden Diff Report ━━━
✓ PASSED test-stock-analysis
⚠ TOOLS_CHANGED test-customer-support added: web_search
~ OUTPUT_CHANGED test-summarizer similarity: 78%
✗ REGRESSION test-code-review score dropped 15 points
1 REGRESSION - fix before deploy
1 TOOLS_CHANGED - review before deploy
Golden Commands
# Save a result as golden baseline
evalview golden save .evalview/results/xxx.json
# Save with notes
evalview golden save result.json --notes "Baseline after v2.0 refactor"
# Save only specific test from a multi-test result
evalview golden save result.json --test "stock-analysis"
# List all golden traces
evalview golden list
# Show details of a golden trace
evalview golden show test-stock-analysis
# Delete a golden trace
evalview golden delete test-stock-analysis
Use case: Add evalview run --diff to CI. Block deploys when behavior regresses.
Tool Categories (Flexible Matching)
Problem: Your test expects read_file. Agent uses bash cat. Test fails. Both are correct.
Solution: Test by intent, not exact tool name.
Before (Brittle)
expected:
tools:
- read_file # Fails if agent uses bash, text_editor, etc.
After (Flexible)
expected:
categories:
- file_read # Passes for read_file, bash cat, text_editor, etc.
Built-in Categories
| Category | Matches |
|---|---|
file_read |
read_file, bash, text_editor, cat, view, str_replace_editor |
file_write |
write_file, bash, text_editor, edit_file, create_file |
file_list |
list_directory, bash, ls, find, directory_tree |
search |
grep, ripgrep, bash, search_files, code_search |
shell |
bash, shell, terminal, execute, run_command |
web |
web_search, browse, fetch_url, http_request, curl |
git |
git, bash, git_commit, git_push, github |
python |
python, bash, python_repl, execute_python, jupyter |
Custom Categories
Add project-specific categories in config.yaml:
# .evalview/config.yaml
tool_categories:
database:
- postgres_query
- mysql_execute
- sql_run
my_custom_api:
- internal_api_call
- legacy_endpoint
Why this matters: Different agents use different tools for the same task. Categories let you test behavior, not implementation.
What it does (in practice)
- Write test cases in YAML – Define inputs, required tools, and scoring thresholds
- Automated evaluation – Tool accuracy, output quality (LLM-as-judge), hallucination checks, cost, latency
- Run in CI/CD – JSON/HTML reports + proper exit codes for blocking deploys
# tests/test-cases/stock-analysis.yaml
name: "Stock Analysis Test"
input:
query: "Analyze Apple stock performance"
expected:
tools:
- fetch_stock_data
- analyze_metrics
output:
contains:
- "revenue"
- "earnings"
thresholds:
min_score: 80
max_cost: 0.50
max_latency: 5000
$ evalview run
✅ Stock Analysis Test - PASSED (score: 92.5)
Cost: $0.0234 | Latency: 3.4s
Generate 1000 Tests from 1
Problem: Writing tests manually is slow. You need volume to catch regressions.
Solution: Auto-generate test variations.
Option 1: Expand from existing tests
# Take 1 test, generate 100 variations
evalview expand tests/stock-test.yaml --count 100
# Focus on specific scenarios
evalview expand tests/stock-test.yaml --count 50 \
--focus "different tickers, edge cases, error scenarios"
Generates variations like:
- Different inputs (AAPL → MSFT, GOOGL, TSLA...)
- Edge cases (invalid tickers, empty input, malformed requests)
- Boundary conditions (very long queries, special characters)
Option 2: Record from live interactions
# Use your agent normally, auto-generate tests
evalview record --interactive
EvalView captures:
- Query → Tools called → Output
- Auto-generates test YAML
- Adds reasonable thresholds
Result: Go from 5 manual tests → 500 comprehensive tests in minutes.
Connect to your agent
Already have an agent running? Use evalview connect to auto-detect it:
# Start your agent (LangGraph, CrewAI, whatever)
langgraph dev
# Auto-detect and connect
evalview connect # Scans ports, detects framework, configures everything
# Run tests
evalview run
Supports 7+ frameworks with automatic detection: LangGraph • CrewAI • OpenAI Assistants • Anthropic Claude • AutoGen • Dify • Custom APIs
EvalView Cloud (Coming Soon)
We're building a hosted version:
- Dashboard - Visual test history, trends, and pass/fail rates
- Teams - Share results and collaborate on fixes
- Alerts - Slack/Discord notifications on failures
- Regression detection - Automatic alerts when performance degrades
- Parallel runs - Run hundreds of tests in seconds
Join the waitlist - be first to get access
Features
- Golden traces - Save baselines, detect regressions with
--diff(docs) - Tool categories - Flexible matching by intent, not exact tool names (docs)
- Test Expansion - Generate 100+ test variations from a single seed test
- Test Recording - Auto-generate tests from live agent interactions
- YAML-based test cases - Write readable, maintainable test definitions
- Parallel execution - Run tests concurrently (8x faster by default)
- Multiple evaluation metrics - Tool accuracy, sequence correctness, output quality, cost, and latency
- LLM-as-judge - Automated output quality assessment
- Cost tracking - Automatic cost calculation based on token usage
- Universal adapters - Works with any HTTP or streaming API
- Rich console output - Beautiful, informative test results
- JSON & HTML reports - Interactive HTML reports with Plotly charts
- Retry logic - Automatic retries with exponential backoff for flaky tests
- Watch mode - Re-run tests automatically on file changes
- Configurable weights - Customize scoring weights globally or per-test
- Statistical mode - Run tests N times, get variance metrics and flakiness scores
- Skills testing - Validate and test Claude Code / OpenAI Codex skills against official Anthropic spec
Installation
# Install (includes skills testing)
pip install evalview
# With HTML reports (Plotly charts)
pip install evalview[reports]
# With watch mode
pip install evalview[watch]
# All optional features
pip install evalview[all]
CLI Reference
evalview quickstart
The fastest way to try EvalView. Creates a demo agent, test case, and runs everything.
evalview run
Run test cases.
evalview run [OPTIONS]
Options:
--pattern TEXT Test case file pattern (default: *.yaml)
-t, --test TEXT Run specific test(s) by name
--diff Compare against golden traces, detect regressions
--verbose Enable verbose logging
--sequential Run tests one at a time (default: parallel)
--max-workers N Max parallel executions (default: 8)
--max-retries N Retry flaky tests N times (default: 0)
--watch Re-run tests on file changes
--html-report PATH Generate interactive HTML report
--summary Compact output with deltas vs last run + regression detection
--coverage Show behavior coverage: tasks, tools, paths, eval dimensions
--judge-model TEXT Model for LLM-as-judge (e.g., gpt-5, sonnet, llama-70b)
--judge-provider TEXT Provider for LLM-as-judge (openai, anthropic, huggingface, gemini, grok, ollama)
Model shortcuts - Use simple names, they auto-resolve:
| Shortcut | Full Model |
|---|---|
gpt-5 |
gpt-5 |
sonnet |
claude-sonnet-4-5-20250929 |
opus |
claude-opus-4-5-20251101 |
llama-70b |
meta-llama/Llama-3.1-70B-Instruct |
gemini |
gemini-3.0 |
llama3.2 |
llama3.2 (Ollama) |
# Examples
evalview run --judge-model gpt-5 --judge-provider openai
evalview run --judge-model sonnet --judge-provider anthropic
evalview run --judge-model llama-70b --judge-provider huggingface # Free!
evalview run --judge-model llama3.2 --judge-provider ollama # Free & Local!
evalview expand
Generate test variations from a seed test case.
evalview expand TEST_FILE --count 100 --focus "edge cases"
evalview record
Record agent interactions and auto-generate test cases.
evalview record --interactive
evalview report
Generate report from results.
evalview report .evalview/results/20241118_004830.json --detailed --html report.html
evalview golden
Manage golden traces for regression detection.
# Save a test result as the golden baseline
evalview golden save .evalview/results/xxx.json
evalview golden save result.json --notes "Post-refactor baseline"
evalview golden save result.json --test "specific-test-name"
# List all golden traces
evalview golden list
# Show details of a golden trace
evalview golden show test-name
# Delete a golden trace
evalview golden delete test-name
evalview golden delete test-name --force
Statistical Mode (Variance Testing)
LLMs are non-deterministic. A test that passes once might fail the next run. Statistical mode addresses this by running tests multiple times and using statistical thresholds for pass/fail decisions.
Enable Statistical Mode
Add variance config to your test case:
# tests/test-cases/my-test.yaml
name: "My Agent Test"
input:
query: "Analyze the market trends"
expected:
tools:
- fetch_data
- analyze
thresholds:
min_score: 70
# Statistical mode config
variance:
runs: 10 # Run test 10 times
pass_rate: 0.8 # 80% of runs must pass
min_mean_score: 70 # Average score must be >= 70
max_std_dev: 15 # Score std dev must be <= 15
What You Get
- Pass rate - Percentage of runs that passed
- Score statistics - Mean, std dev, min/max, percentiles, confidence intervals
- Flakiness score - 0 (stable) to 1 (flaky) with category labels
- Contributing factors - Why the test is flaky (score variance, tool inconsistency, etc.)
Example Output
Statistical Evaluation: My Agent Test
PASSED
┌─ Run Summary ─────────────────────────┐
│ Total Runs: 10 │
│ Passed: 8 │
│ Failed: 2 │
│ Pass Rate: 80% (required: 80%) │
└───────────────────────────────────────┘
Score Statistics:
Mean: 79.86 95% CI: [78.02, 81.70]
Std Dev: 2.97 ▂▂▁▁▁ Low variance
Min: 75.5
Max: 84.5
┌─ Flakiness Assessment ────────────────┐
│ Flakiness Score: 0.12 ██░░░░░░░░ │
│ Category: low_variance │
│ Pass Rate: 80% │
└───────────────────────────────────────┘
See examples/statistical-mode-example.yaml for a complete example.
Evaluation Metrics
| Metric | Weight | Description |
|---|---|---|
| Tool Accuracy | 30% | Checks if expected tools were called |
| Output Quality | 50% | LLM-as-judge evaluation |
| Sequence Correctness | 20% | Validates exact tool call order |
| Cost Threshold | Pass/Fail | Must stay under max_cost |
| Latency Threshold | Pass/Fail | Must complete under max_latency |
Weights are configurable globally or per-test.
CI/CD Integration
EvalView is CLI-first. You can run it locally or add to CI.
GitHub Action (Recommended)
Use the official EvalView GitHub Action for the simplest setup:
name: EvalView Agent Tests
on: [push, pull_request]
jobs:
test-agents:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run EvalView
uses: hidai25/[email protected]
with:
openai-api-key: ${{ secrets.OPENAI_API_KEY }}
max-workers: '4'
fail-on-error: 'true'
Action Inputs
| Input | Description | Default |
|---|---|---|
openai-api-key |
OpenAI API key for LLM-as-judge | - |
anthropic-api-key |
Anthropic API key (optional) | - |
config-path |
Path to config file | .evalview/config.yaml |
filter |
Filter tests by name pattern | - |
max-workers |
Parallel workers | 4 |
max-retries |
Retry failed tests | 2 |
fail-on-error |
Fail workflow on test failure | true |
generate-report |
Generate HTML report | true |
python-version |
Python version | 3.11 |
Action Outputs
| Output | Description |
|---|---|
results-file |
Path to JSON results |
report-file |
Path to HTML report |
total-tests |
Total tests run |
passed-tests |
Passed count |
failed-tests |
Failed count |
pass-rate |
Pass rate percentage |
Full Example with PR Comments
name: EvalView Agent Tests
on:
pull_request:
branches: [main]
jobs:
test-agents:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run EvalView
id: evalview
uses: hidai25/[email protected]
with:
openai-api-key: ${{ secrets.OPENAI_API_KEY }}
- name: Upload results
uses: actions/upload-artifact@v4
with:
name: evalview-results
path: |
.evalview/results/*.json
evalview-report.html
- name: Comment on PR
uses: actions/github-script@v7
with:
script: |
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: `## EvalView Results\n\n✅ ${`${{ steps.evalview.outputs.passed-tests }}`}/${`${{ steps.evalview.outputs.total-tests }}`} tests passed (${`${{ steps.evalview.outputs.pass-rate }}`}%)`
});
Manual Setup (Alternative)
If you prefer manual setup:
name: EvalView Agent Tests
on: [push, pull_request]
jobs:
evalview:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- run: pip install evalview
- run: evalview run --pattern "tests/test-cases/*.yaml"
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
Architecture
evalview/
├── adapters/ # Agent communication (HTTP, OpenAI, Anthropic, etc.)
├── evaluators/ # Evaluation logic (tools, output, cost, latency)
├── reporters/ # Output formatting (console, JSON, HTML)
├── core/ # Types, config, parallel execution
└── cli.py # Click CLI
Guides
| Guide | Description |
|---|---|
| Testing LangGraph Agents in CI | Set up automated testing for LangGraph agents with GitHub Actions |
| Detecting LLM Hallucinations | Catch hallucinations and made-up facts before they reach users |
Further Reading
| Topic | Description |
|---|---|
| Getting Started | 5-minute quickstart guide |
| Framework Support | Supported frameworks and compatibility |
| Cost Tracking | Token usage and cost calculation |
| Debugging Guide | Troubleshooting common issues |
| Adapters | Building custom adapters |
Examples
- LangGraph Integration - Test LangGraph agents
- CrewAI Integration - Test CrewAI agents
- Anthropic Claude - Test Claude API and Claude Agent SDK
- Dify Workflows - Test Dify AI workflows
- Ollama (Local LLMs) - Test with local Llama models + free local evaluation
Using Node.js / Next.js? See @evalview/node for drop-in middleware.
Skills Testing (Claude Code & OpenAI Codex)
Your Skills Are Probably Broken. Claude Is Ignoring Them.
Common symptoms:
- Skills installed but never trigger
- Claude says "I don't have that skill"
- Works locally, breaks in production
- No errors, just... silence
Why it happens: Claude Code has a 15k character budget for skill descriptions. Exceed it and skills aren't loaded. No warning. No error. EvalView catches this.
EvalView catches this before you waste hours debugging:
30 Seconds: Validate Your Skill
pip install evalview
evalview skill validate ./SKILL.md
That's it. Catches naming errors, missing fields, reserved words, and spec violations.
Try it now with the included example:
evalview skill validate examples/skills/test-skill/SKILL.md
Why Is Claude Ignoring My Skills?
Run the doctor to find out:
evalview skill doctor ~/.claude/skills/
⚠️ Character Budget: 127% OVER - Claude is ignoring 4 of your 24 skills
ISSUE: Character budget exceeded
Claude Code won't see all your skills.
Fix: Set SLASH_COMMAND_TOOL_CHAR_BUDGET=30000 or reduce descriptions
ISSUE: Duplicate skill names
code-reviewer defined in:
- ~/.claude/skills/old/SKILL.md
- ~/.claude/skills/new/SKILL.md
✗ 4 skills are INVISIBLE to Claude - fix now
This is why your skills "don't work." Claude literally can't see them.
2 Minutes: Add Behavior Tests + CI
1. Create a test file next to your SKILL.md:
# tests.yaml
name: my-skill-tests
skill: ./SKILL.md
tests:
- name: basic-test
input: "Your test prompt"
expected:
output_contains: ["expected", "words"]
2. Run locally
echo "ANTHROPIC_API_KEY=your-key" > .env.local
evalview skill test tests.yaml
3. Add to CI — copy examples/skills/test-skill/.github/workflows/skill-tests.yml to your repo
Starter template: See examples/skills/test-skill/ for a complete copy-paste example with GitHub Actions.
Validate Skill Structure
Catch errors before Claude ever sees your skill:
# Validate a single skill
evalview skill validate ./my-skill/SKILL.md
# Validate all skills in a directory
evalview skill validate ~/.claude/skills/ -r
# CI-friendly JSON output
evalview skill validate ./skills/ -r --json
Validates against official Anthropic spec:
name: max 64 chars, lowercase/numbers/hyphens only, no reserved words ("anthropic", "claude")description: max 1024 chars, non-empty, no XML tags- Token size (warns if >5k tokens)
- Policy compliance (no prompt injection patterns)
- Best practices (examples, guidelines sections)
━━━ Skill Validation Results ━━━
✓ skills/code-reviewer/SKILL.md
Name: code-reviewer
Tokens: ~2,400
✓ skills/doc-writer/SKILL.md
Name: doc-writer
Tokens: ~1,800
✗ skills/broken/SKILL.md
ERROR [MISSING_DESCRIPTION] Skill description is required
Summary: 2 valid, 1 invalid
Test Skill Behavior
Validation catches syntax errors. Behavior tests catch logic errors.
Define what your skill should do, then verify it actually does it:
# tests/code-reviewer.yaml
name: test-code-reviewer
skill: ./skills/code-reviewer/SKILL.md
tests:
- name: detects-sql-injection
input: |
Review this code:
query = f"SELECT * FROM users WHERE id = {user_id}"
expected:
output_contains: ["SQL injection", "parameterized"]
output_not_contains: ["looks good", "no issues"]
- name: approves-safe-code
input: |
Review this code:
query = db.execute("SELECT * FROM users WHERE id = ?", [user_id])
expected:
output_contains: ["secure", "parameterized"]
output_not_contains: ["vulnerability", "injection"]
Run it:
# Option 1: Environment variable
export ANTHROPIC_API_KEY=your-key
# Option 2: Create .env.local file (auto-loaded)
echo "ANTHROPIC_API_KEY=your-key" > .env.local
# Run the tests
evalview skill test tests/code-reviewer.yaml
━━━ Running Skill Tests ━━━
Suite: test-code-reviewer
Skill: ./skills/code-reviewer/SKILL.md
Model: claude-sonnet-4-20250514
Tests: 2
Results:
PASS detects-sql-injection
PASS approves-safe-code
Summary: ✓
Pass rate: 100% (2/2)
Avg latency: 1,240ms
Total tokens: 3,847
Why Test Skills?
You can test skills manually in Claude Code. So why use EvalView?
Manual testing works for development. EvalView is for automation:
| Manual Testing | EvalView |
|---|---|
| Test while you write | Test on every commit |
| You remember to test | CI blocks bad merges |
| Test a few cases | Test 50+ scenarios |
| "It works for me" | Reproducible results |
| Catch bugs after publish | Catch bugs before publish |
Who needs automated skill testing?
- Skill authors publishing to marketplaces
- Enterprise teams rolling out skills to thousands of employees
- Open source maintainers accepting contributions from the community
- Anyone who wants CI/CD for their skills
Skills are code. Code needs tests. EvalView brings the rigor of software testing to the AI skills ecosystem.
Compatible With
| Platform | Status |
|---|---|
| Claude Code | Supported |
| Claude.ai Skills | Supported |
| OpenAI Codex CLI | Same SKILL.md format |
| Custom Skills | Any SKILL.md file |
Like what you see?
If EvalView caught a regression, saved you debugging time, or kept your agent costs in check — give it a ⭐ star to help others discover it.
Roadmap
Shipped:
- [x] Golden traces & regression detection (
evalview run --diff) - [x] Tool categories for flexible matching
- [x] Multi-run flakiness detection
- [x] Skills testing (Claude Code, OpenAI Codex)
- [x] MCP server testing (
adapter: mcp) - [x] HTML diff reports (
--diff-report)
Coming Soon:
- [ ] Multi-turn conversation testing
- [ ] Grounded hallucination checking
- [ ] LLM-as-judge for skill guideline compliance
- [ ] Error compounding metrics
Want these? Vote in GitHub Discussions
FAQ
Does EvalView work with LangChain / LangGraph?
Yes. Use the langgraph adapter. See examples/langgraph/.
Does EvalView work with CrewAI?
Yes. Use the crewai adapter. See examples/crewai/.
Does EvalView work with OpenAI Assistants?
Yes. Use the openai-assistants adapter.
Does EvalView work with Anthropic Claude?
Yes. Use the anthropic adapter. See examples/anthropic/.
How much does it cost? EvalView is free and open source. You pay only for LLM API calls (for LLM-as-judge evaluation). Use Ollama for free local evaluation.
Can I use it without an API key?
Yes. Use Ollama for free local LLM-as-judge: evalview run --judge-provider ollama --judge-model llama3.2
Can I run EvalView in CI/CD? Yes. EvalView has a GitHub Action and proper exit codes. See CI/CD Integration.
Does EvalView require a database? No. EvalView runs without any database by default. Results print to console and save as JSON.
How is EvalView different from LangSmith? LangSmith is for tracing/observability. EvalView is for testing. Use both: LangSmith to see what happened, EvalView to block bad behavior before prod.
Can I test for hallucinations? Yes. EvalView has built-in hallucination detection that compares agent output against tool results.
Can I test Claude Code skills?
Yes. Use evalview skill validate for structure checks and evalview skill test for behavior tests. See Skills Testing.
Does EvalView work with OpenAI Codex CLI skills? Yes. Codex CLI uses the same SKILL.md format as Claude Code. Your tests work for both.
Do I need an API key for skill validation?
No. evalview skill validate runs locally without any API calls. Only evalview skill test requires an Anthropic API key.
Contributing
Contributions are welcome! Please open an issue or submit a pull request.
See CONTRIBUTING.md for guidelines.
License
EvalView is open source software licensed under the Apache License 2.0.
Support
- Issues: https://github.com/hidai25/eval-view/issues
- Discussions: https://github.com/hidai25/eval-view/discussions
Affiliations
EvalView is an independent open-source project and is not affiliated with, endorsed by, or sponsored by LangGraph, CrewAI, OpenAI, Anthropic, or any other third party mentioned. All product names, logos, and brands are property of their respective owners.
Ship AI agents with confidence.