Add AG-UI Protocol Integration for Agent Evaluation
Summary
This PR adds comprehensive support for evaluating agents that use the AG-UI (Agent User Interaction) protocol, enabling real-time evaluation of streaming agent interactions with support for tool calls, multi-turn conversations, and full event-based message reconstruction.
What is AG-UI?
AG-UI is an event-based protocol for streaming agent-to-UI communication that uses typed events for messages, tool calls, and state synchronization. Popular agent frameworks supporting AG-UI include:
- LangGraph (LangChain)
- Google ADK (Agent Development Kit)
- Pydantic AI
- Mastra
🎯 Key Features
-
Core Event Processing - Converts core AG-UI events into Ragas messages.
-
AG-UI Endpoint Integration - Because all AG-UI compliant agents emit a common event stream, they can be invoked directly as part of the eval.
-
Multi-Turn Conversation Support - Supports multi-turn evals and tool call evals.
📁 Files Added
Integration Code:
- src/ragas/integrations/ag_ui.py (1,283 lines) - Complete AG-UI integration
Tests:
- tests/unit/integrations/test_ag_ui.py (1,186 lines) - 33 tests covering all features
Examples:
- examples/ragas_examples/ag_ui_agent_evals/ - Complete runnable example
- evals.py (314 lines) - Evaluation script with two scenarios
- README.md (314 lines) - Comprehensive documentation
- test_data/scientist_biographies.csv - Factual correctness test cases
- test_data/weather_tool_calls.csv - Tool call evaluation test cases
🧪 Testing
33 comprehensive unit tests covering:
- ✅ Basic event conversion and streaming reconstruction
- ✅ Metadata preservation across event types
- ✅ Tool call parsing and association with AI messages
- ✅ Multi-turn conversation handling
- ✅ Chunk event processing (both text and tool calls)
- ✅ Message snapshot conversion with type-based checking
- ✅ Error handling (malformed JSON, incomplete sequences, orphaned events)
- ✅ Role mapping (user → HumanMessage, assistant → AIMessage)
- ✅ FastAPI endpoint integration with mocked SSE responses
- ✅ MultiTurnSample support with conversation appending
- ✅ Retroactive tool call attachment for validation compliance
Test Coverage: All edge cases including invalid JSON, missing messages, event ordering issues, and validation requirements.
🔧 Technical Implementation
Key Classes:
- AGUIEventCollector: Stateful event accumulation with streaming reconstruction
- Caches AG-UI imports for performance
- Tracks context (run_id, thread_id, step) for metadata
- Handles both streaming triads and chunk events
Event Processing Flow:
- Lifecycle events update context (run_id, thread_id, step)
- Text message events accumulate content chunks
- Tool call events accumulate args and create ToolCall objects
- Message end events create Ragas messages with pending tool calls
- Tool result events ensure preceding AIMessage has tool_calls (validation requirement)
Multi-Turn Processing:
- Converts Ragas messages → AG-UI messages for request payload
- Sends to endpoint and collects AG-UI events
- Converts events → new Ragas messages (AIMessage, ToolMessage only)
- Appends to conversation for iterative evaluation
📖 Documentation
- Comprehensive docstrings for all public functions
- Module-level examples for common use cases
- Complete README in examples directory with:
- Setup instructions linking to AG-UI quickstart
- Usage examples with all CLI options
- Expected output formats
- Troubleshooting guide
- Metric interpretation guide
🎓 References
- AG-UI Documentation: https://docs.ag-ui.com
- AG-UI Quickstart: https://docs.ag-ui.com/quickstart/applications
- Compatible frameworks: LangGraph, Google ADK, Pydantic AI, Mastra
Ready for review! All tests passing ✅ (33/33)
Looks great @contextablemark Thanks for the PR 🙌🏼
Please check the formatting stuff which fails the CI. Run:
make run-cilocally to check it all.Would you also mind adding a docs page for the integration as well in here?
Sure... sounds good. I also see the code quality issues that need addressing that are coming out in the check builds.
Looks great @contextablemark Thanks for the PR 🙌🏼
Please check the formatting stuff which fails the CI. Run:
make run-cilocally to check it all.Would you also mind adding a docs page for the integration as well in here?
@anistark Please re-review. make run-ci should be passing now and I added "How-to" docs along with the .ipynb (and associated generated .md). Please let me know if anything else is needed.
@anistark It appears that there were a couple of additional issues that crept in once I added the examples. The build should be clean now.
Thanks for the update @contextablemark
We're refactoring our metrics approach from
LangchainLLMWrapperto work withInstructorLLMviallm_factory.While your code update works with both, the doc shows the earlier approach. We're yet to write a detailed guide on the migration, but since we're adding this at such a stage, would be great to have it in the newer structure to avoid being updated again by next week. :)
Checkout more info on the implementation changes in new metrics collections approach:
/src/ragas/metrics/collections/
Sure... I had tried using it initially when I saw the deprecation warning message, but had some issues - I'll take another look.
@anistark Looking into the refactoring raised some additional issues/questions regarding other steps in the workflow :
-
Support in the core evaluator: It seems that ragas.evaluate doesn’t recognize collections metrics yet and still expects legacy Metric subclasses. Is there a plan in the works to bring collections metrics into the core evaluator?
-
Blended metrics: - Along similar lines, if a workflow needs both legacy and collections metrics, is there a recommended “bridge” pattern—or should we avoid mixing the two until the execution pipeline handles both families?
-
Manual evaluation path: In the interim, is it acceptable to call metric.ascore(...) manually for the Instructor-based metrics and stitch the results together (i.e., handling all of the orchestration directly)?
-
Documentation guidance: Should any documentation mention the manual evaluation as an interim approach, hinting at something else to come?
Just trying to figure out whether it makes sense to wait until next week if some changes are imminent that will make the overall implementation easier.
- Mark
- Support in the core evaluator: It seems that ragas.evaluate doesn’t recognize collections metrics yet and still expects legacy Metric subclasses. Is there a plan in the works to bring collections metrics into the core evaluator?
evaluate will be deprecated once all metrics are migrated to collections.
- Blended metrics: - Along similar lines, if a workflow needs both legacy and collections metrics, is there a recommended “bridge” pattern—or should we avoid mixing the two until the execution pipeline handles both families?
I think we can focus on collections as of now going forward. We'll support legacy evaluate till a certain version (undecided) and then completely remove.
- Manual evaluation path: In the interim, is it acceptable to call metric.ascore(...) manually for the Instructor-based metrics and stitch the results together (i.e., handling all of the orchestration directly)?
While manually is fine, but better to align with rest of it, so we don't have to make changes again in couple of weeks.
- Documentation guidance: Should any documentation mention the manual evaluation as an interim approach, hinting at something else to come?
If you want to do manual, then yes. Otherwise, not required.
@anistark Thanks for the answers to my questions. I'm starting to think that my integration may be attempting to do too much; in particular the "evaluate_ag_ui_agent" method has at its core ragas.evaluate, which is going away. And if this PR is reflective of the intended direction of the overall project, I may need to rethink my examples.
Do you think it would make sense for me to wait for the "dust to settle" and revisit the topic next week after more of the changes have been merged? If so, I'll move this PR back to draft.
Thanks, Mark
Do you think it would make sense for me to wait for the "dust to settle" and revisit the topic next week after more of the changes have been merged? If so, I'll move this PR back to draft.
Sure, if that's what you want. Can park it till all migrations are done, with docs update.
Do you think it would make sense for me to wait for the "dust to settle" and revisit the topic next week after more of the changes have been merged? If so, I'll move this PR back to draft.
Sure, if that's what you want. Can park it till all migrations are done, with docs update.
Thanks... I'll keep an eye on the situation
Hi @anistark - I've updated my examples to use the collection metrics where appropriate. Please take another look.
Reverting to draft - the .ipynb notebook still needs to be updated.
@contextablemark We've released v0.4.0
Please check out and update your PR accordingly, so we can get it merged soon. :)
@anistark I've made some changes that hopefully bring the integration into alignment with 0.4.0. Please let me know what you think!
@anistark I did some refactoring and simplification to (I think) bring the integration in line with the "@experiment" paradigm (removing one third of the code in the process). Please take another look when you get a chance.
@anistark Any additional suggestions?
🎉