verifiers
verifiers copied to clipboard
feat: Add GEPA prompt optimization (`vf-gepa`)
Description
Note: This feature was originally developed by Zapier for internal use with Verifiers environments. We're excited to contribute it back to the open-source project.
This PR adds GEPA (Genetic-Pareto) integration to Verifiers: an automatic prompt optimization system that improves environment prompts through reflection-based evolution.
GEPA works by:
- Testing current prompts on training examples
- Collecting rich feedback from rubric evaluations
- Using an LLM to reflect on failures and propose improved prompts
- Iteratively refining
New CLI Command: vf-gepa
Optimize system prompt with medium budget (~12 candidates)
vf-gepa wordle --budget medium
Optimize both system prompt and tool descriptions
vf-gepa wiki-search --budget heavy --components system_prompt tool_descriptions
Custom configuration
vf-gepa my-env --max-metric-calls 1000 -n 100 --num-val 30 -m gpt-5-mini
Results are saved to ./gepa_results/<env_id>/<run_id>/ including optimized components, original components for comparison, and optimization metrics.
Type of Change
- [ ] Bug fix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
- [ ] Documentation update
- [ ] Test improvement
Testing
- [x] All existing tests pass when running
uv run pytestlocally. - [x] New tests have been added to cover the changes
- [x] Tested on
wordle,gsm8k, andtool_test(w/--components tool_descriptions) environments
(e.g. gemini-2.5-flash on a 5-min GEPA run):
================================================================================
GEPA OPTIMIZATION COMPLETE
================================================================================
Best validation score: 0.900
Initial validation score: 0.600
Improvement: 0.300
Total candidates fully explored: 4
second attempt:
================================================================================
GEPA OPTIMIZATION COMPLETE
================================================================================
Best validation score: 1.000
Initial validation score: 0.400
Improvement: 0.600
Total candidates fully explored: 2
Checklist
- [x] My code follows the style guidelines of this project as outlined in AGENTS.md
- [x] I have performed a self-review of my own code
- [x] I have commented my code, particularly in hard-to-understand areas
- [x] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [x] Any dependent changes have been merged and published
Additional Notes
How GEPA Works
The integration consists of several key components:
1. GEPAAdapter (verifiers/gepa/adapter.py)
Bridges Verifiers environments with GEPA's optimization protocol:
- Component extraction/injection: Extracts optimizable text (system prompts, tool descriptions) and injects optimized versions back into environments
- Evaluation: Runs rollouts and collects scores using the environment's rubric
- Reflective dataset generation: Converts rubric feedback into structured reflection data for GEPA
- Tool-aware proposal: Uses specialized templates for tool description optimization that include tool names and parameter schemas
2. Budget Modes Three preset budgets control optimization intensity:
light(~6 candidates): Quick iteration, ~30-60 minmedium(~12 candidates): Balanced exploration, ~1-2 hoursheavy(~18 candidates): Thorough optimization, ~2-4 hours
3. Component Selection
GEPA can optimize multiple components:
--components system_prompt # Default
--components tool_descriptions # For tool-using environments
--components system_prompt tool_descriptions # Both
When optimizing tool_descriptions, each tool's description becomes a separate optimizable component (tool_0_description, tool_1_description, etc.).
Rubric Changes: Feedback Support
A key non-breaking change enables reward functions to return both a score and textual feedback:
Before (still works):
def accuracy(completion, answer, **kwargs) -> float:
return 1.0 if completion == answer else 0.0
New: Return feedback for better GEPA optimization
def accuracy_with_feedback(completion, answer, **kwargs):
correct = completion == answer
return {
"score": 1.0 if correct else 0.0,
"feedback": f"Expected: {answer}, Got: {completion}. {'✓ Correct!' if correct else '✗ Incorrect.'}"
}
The feedback is collected via rubric.get_feedback(state) and used by GEPA's reflection model to understand why rollouts succeeded or failed. This enables more targeted prompt improvements.
Changes to Rubric class:
- Added
RewardResultTypedDict intypes.pyfor type-safe{"score": float, "feedback": str}returns - Updated
_parse_reward_result()to handle both float and dict returns - Added
get_feedback(state)method to aggregate feedback from all reward functions - Feedback is stored in
state["feedbacks"]duringscore_rollout()andscore_group()
Experiment Tracking
Built-in support for wandb and MLflow:
Track with wandb:
vf-gepa my-env --budget medium --use-wandb --wandb-project my-project
Track with MLflow:
vf-gepa my-env --budget medium --use-mlflow --mlflow-tracking-uri http://localhost:5000