verifiers icon indicating copy to clipboard operation
verifiers copied to clipboard

feat: Add GEPA prompt optimization (`vf-gepa`)

Open rsalimans47 opened this issue 2 weeks ago • 1 comments

Description

Note: This feature was originally developed by Zapier for internal use with Verifiers environments. We're excited to contribute it back to the open-source project.

This PR adds GEPA (Genetic-Pareto) integration to Verifiers: an automatic prompt optimization system that improves environment prompts through reflection-based evolution.

GEPA works by:

  1. Testing current prompts on training examples
  2. Collecting rich feedback from rubric evaluations
  3. Using an LLM to reflect on failures and propose improved prompts
  4. Iteratively refining

New CLI Command: vf-gepa

Optimize system prompt with medium budget (~12 candidates)

vf-gepa wordle --budget medium

Optimize both system prompt and tool descriptions

vf-gepa wiki-search --budget heavy --components system_prompt tool_descriptions

Custom configuration

vf-gepa my-env --max-metric-calls 1000 -n 100 --num-val 30 -m gpt-5-mini

Results are saved to ./gepa_results/<env_id>/<run_id>/ including optimized components, original components for comparison, and optimization metrics.

Type of Change

  • [ ] Bug fix (non-breaking change which fixes an issue)
  • [x] New feature (non-breaking change which adds functionality)
  • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • [ ] Documentation update
  • [ ] Test improvement

Testing

  • [x] All existing tests pass when running uv run pytest locally.
  • [x] New tests have been added to cover the changes
  • [x] Tested on wordle, gsm8k, and tool_test (w/ --components tool_descriptions) environments
image

(e.g. gemini-2.5-flash on a 5-min GEPA run):

================================================================================
GEPA OPTIMIZATION COMPLETE
================================================================================
Best validation score: 0.900
Initial validation score: 0.600
Improvement: 0.300
Total candidates fully explored: 4

second attempt:

================================================================================
GEPA OPTIMIZATION COMPLETE
================================================================================
Best validation score: 1.000
Initial validation score: 0.400
Improvement: 0.600
Total candidates fully explored: 2

Checklist

  • [x] My code follows the style guidelines of this project as outlined in AGENTS.md
  • [x] I have performed a self-review of my own code
  • [x] I have commented my code, particularly in hard-to-understand areas
  • [x] I have made corresponding changes to the documentation
  • [x] My changes generate no new warnings
  • [x] Any dependent changes have been merged and published

Additional Notes

How GEPA Works

The integration consists of several key components:

1. GEPAAdapter (verifiers/gepa/adapter.py) Bridges Verifiers environments with GEPA's optimization protocol:

  • Component extraction/injection: Extracts optimizable text (system prompts, tool descriptions) and injects optimized versions back into environments
  • Evaluation: Runs rollouts and collects scores using the environment's rubric
  • Reflective dataset generation: Converts rubric feedback into structured reflection data for GEPA
  • Tool-aware proposal: Uses specialized templates for tool description optimization that include tool names and parameter schemas

2. Budget Modes Three preset budgets control optimization intensity:

  • light (~6 candidates): Quick iteration, ~30-60 min
  • medium (~12 candidates): Balanced exploration, ~1-2 hours
  • heavy (~18 candidates): Thorough optimization, ~2-4 hours

3. Component Selection GEPA can optimize multiple components: --components system_prompt # Default --components tool_descriptions # For tool-using environments --components system_prompt tool_descriptions # Both

When optimizing tool_descriptions, each tool's description becomes a separate optimizable component (tool_0_description, tool_1_description, etc.).

Rubric Changes: Feedback Support

A key non-breaking change enables reward functions to return both a score and textual feedback:

Before (still works):

def accuracy(completion, answer, **kwargs) -> float:
    return 1.0 if completion == answer else 0.0

New: Return feedback for better GEPA optimization

def accuracy_with_feedback(completion, answer, **kwargs):
    correct = completion == answer
    return {
        "score": 1.0 if correct else 0.0,
        "feedback": f"Expected: {answer}, Got: {completion}. {'✓ Correct!' if correct else '✗ Incorrect.'}"
    }

The feedback is collected via rubric.get_feedback(state) and used by GEPA's reflection model to understand why rollouts succeeded or failed. This enables more targeted prompt improvements.

Changes to Rubric class:

  • Added RewardResult TypedDict in types.py for type-safe {"score": float, "feedback": str} returns
  • Updated _parse_reward_result() to handle both float and dict returns
  • Added get_feedback(state) method to aggregate feedback from all reward functions
  • Feedback is stored in state["feedbacks"] during score_rollout() and score_group()

Experiment Tracking

Built-in support for wandb and MLflow:

Track with wandb:

vf-gepa my-env --budget medium --use-wandb --wandb-project my-project

Track with MLflow:

vf-gepa my-env --budget medium --use-mlflow --mlflow-tracking-uri http://localhost:5000

rsalimans47 avatar Nov 25 '25 14:11 rsalimans47

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Nov 25 '25 14:11 CLAassistant