verifiers feat: Add GEPA prompt optimization (`vf-gepa`)

feat: Add GEPA prompt optimization (`vf-gepa`)

Open rsalimans47 opened this issue 2 weeks ago • 1 comments

Description

Note: This feature was originally developed by Zapier for internal use with Verifiers environments. We're excited to contribute it back to the open-source project.

This PR adds GEPA (Genetic-Pareto) integration to Verifiers: an automatic prompt optimization system that improves environment prompts through reflection-based evolution.

GEPA works by:

Testing current prompts on training examples
Collecting rich feedback from rubric evaluations
Using an LLM to reflect on failures and propose improved prompts
Iteratively refining

New CLI Command: `vf-gepa`

Optimize system prompt with medium budget (~12 candidates)

vf-gepa wordle --budget medium

Optimize both system prompt and tool descriptions

vf-gepa wiki-search --budget heavy --components system_prompt tool_descriptions

Custom configuration

vf-gepa my-env --max-metric-calls 1000 -n 100 --num-val 30 -m gpt-5-mini

Results are saved to ./gepa_results/<env_id>/<run_id>/ including optimized components, original components for comparison, and optimization metrics.

Type of Change

[ ] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[ ] Documentation update
[ ] Test improvement

Testing

[x] All existing tests pass when running uv run pytest locally.
[x] New tests have been added to cover the changes
[x] Tested on wordle, gsm8k, and tool_test (w/ --components tool_descriptions) environments

(e.g. gemini-2.5-flash on a 5-min GEPA run):

================================================================================
GEPA OPTIMIZATION COMPLETE
================================================================================
Best validation score: 0.900
Initial validation score: 0.600
Improvement: 0.300
Total candidates fully explored: 4

second attempt:

================================================================================
GEPA OPTIMIZATION COMPLETE
================================================================================
Best validation score: 1.000
Initial validation score: 0.400
Improvement: 0.600
Total candidates fully explored: 2

Checklist

[x] My code follows the style guidelines of this project as outlined in AGENTS.md
[x] I have performed a self-review of my own code
[x] I have commented my code, particularly in hard-to-understand areas
[x] I have made corresponding changes to the documentation
[x] My changes generate no new warnings
[x] Any dependent changes have been merged and published

Additional Notes

How GEPA Works

The integration consists of several key components:

1. GEPAAdapter (verifiers/gepa/adapter.py) Bridges Verifiers environments with GEPA's optimization protocol:

Component extraction/injection: Extracts optimizable text (system prompts, tool descriptions) and injects optimized versions back into environments
Evaluation: Runs rollouts and collects scores using the environment's rubric
Reflective dataset generation: Converts rubric feedback into structured reflection data for GEPA
Tool-aware proposal: Uses specialized templates for tool description optimization that include tool names and parameter schemas

2. Budget Modes Three preset budgets control optimization intensity:

light (~6 candidates): Quick iteration, ~30-60 min
medium (~12 candidates): Balanced exploration, ~1-2 hours
heavy (~18 candidates): Thorough optimization, ~2-4 hours

3. Component Selection GEPA can optimize multiple components: --components system_prompt # Default --components tool_descriptions # For tool-using environments --components system_prompt tool_descriptions # Both

When optimizing tool_descriptions, each tool's description becomes a separate optimizable component (tool_0_description, tool_1_description, etc.).

Rubric Changes: Feedback Support

A key non-breaking change enables reward functions to return both a score and textual feedback:

Before (still works):

def accuracy(completion, answer, **kwargs) -> float:
    return 1.0 if completion == answer else 0.0

New: Return feedback for better GEPA optimization

def accuracy_with_feedback(completion, answer, **kwargs):
    correct = completion == answer
    return {
        "score": 1.0 if correct else 0.0,
        "feedback": f"Expected: {answer}, Got: {completion}. {'✓ Correct!' if correct else '✗ Incorrect.'}"
    }

The feedback is collected via rubric.get_feedback(state) and used by GEPA's reflection model to understand why rollouts succeeded or failed. This enables more targeted prompt improvements.

Changes to Rubric class:

Added RewardResult TypedDict in types.py for type-safe {"score": float, "feedback": str} returns
Updated _parse_reward_result() to handle both float and dict returns
Added get_feedback(state) method to aggregate feedback from all reward functions
Feedback is stored in state["feedbacks"] during score_rollout() and score_group()

Experiment Tracking

Built-in support for wandb and MLflow:

Track with wandb:

vf-gepa my-env --budget medium --use-wandb --wandb-project my-project

Track with MLflow:

vf-gepa my-env --budget medium --use-mlflow --mlflow-tracking-uri http://localhost:5000

Nov 25 '25 14:11 rsalimans47

All committers have signed the CLA.

Nov 25 '25 14:11 CLAassistant

verifiers verifiers copied to clipboard

feat: Add GEPA prompt optimization (`vf-gepa`)

Description

New CLI Command: vf-gepa

Optimize system prompt with medium budget (~12 candidates)

Optimize both system prompt and tool descriptions

Custom configuration

Type of Change

Testing

Checklist

Additional Notes

How GEPA Works

Rubric Changes: Feedback Support

Experiment Tracking

verifiers
verifiers copied to clipboard

New CLI Command: `vf-gepa`