ragas Add AG-UI Protocol Integration for Agent Evaluation

Summary

This PR adds comprehensive support for evaluating agents that use the AG-UI (Agent User Interaction) protocol, enabling real-time evaluation of streaming agent interactions with support for tool calls, multi-turn conversations, and full event-based message reconstruction.

What is AG-UI?

AG-UI is an event-based protocol for streaming agent-to-UI communication that uses typed events for messages, tool calls, and state synchronization. Popular agent frameworks supporting AG-UI include:

LangGraph (LangChain)
Google ADK (Agent Development Kit)
Pydantic AI
Mastra

🎯 Key Features

Core Event Processing - Converts core AG-UI events into Ragas messages.
AG-UI Endpoint Integration - Because all AG-UI compliant agents emit a common event stream, they can be invoked directly as part of the eval.
Multi-Turn Conversation Support - Supports multi-turn evals and tool call evals.

📁 Files Added

Integration Code:

src/ragas/integrations/ag_ui.py (1,283 lines) - Complete AG-UI integration

Tests:

tests/unit/integrations/test_ag_ui.py (1,186 lines) - 33 tests covering all features

Examples:

examples/ragas_examples/ag_ui_agent_evals/ - Complete runnable example
- evals.py (314 lines) - Evaluation script with two scenarios
- README.md (314 lines) - Comprehensive documentation
- test_data/scientist_biographies.csv - Factual correctness test cases
- test_data/weather_tool_calls.csv - Tool call evaluation test cases

🧪 Testing

33 comprehensive unit tests covering:

✅ Basic event conversion and streaming reconstruction
✅ Metadata preservation across event types
✅ Tool call parsing and association with AI messages
✅ Multi-turn conversation handling
✅ Chunk event processing (both text and tool calls)
✅ Message snapshot conversion with type-based checking
✅ Error handling (malformed JSON, incomplete sequences, orphaned events)
✅ Role mapping (user → HumanMessage, assistant → AIMessage)
✅ FastAPI endpoint integration with mocked SSE responses
✅ MultiTurnSample support with conversation appending
✅ Retroactive tool call attachment for validation compliance

Test Coverage: All edge cases including invalid JSON, missing messages, event ordering issues, and validation requirements.

🔧 Technical Implementation

Key Classes:

AGUIEventCollector: Stateful event accumulation with streaming reconstruction
Caches AG-UI imports for performance
Tracks context (run_id, thread_id, step) for metadata
Handles both streaming triads and chunk events

Event Processing Flow:

Lifecycle events update context (run_id, thread_id, step)
Text message events accumulate content chunks
Tool call events accumulate args and create ToolCall objects
Message end events create Ragas messages with pending tool calls
Tool result events ensure preceding AIMessage has tool_calls (validation requirement)

Multi-Turn Processing:

Converts Ragas messages → AG-UI messages for request payload
Sends to endpoint and collects AG-UI events
Converts events → new Ragas messages (AIMessage, ToolMessage only)
Appends to conversation for iterative evaluation

📖 Documentation

Comprehensive docstrings for all public functions
Module-level examples for common use cases
Complete README in examples directory with:
- Setup instructions linking to AG-UI quickstart
- Usage examples with all CLI options
- Expected output formats
- Troubleshooting guide
- Metric interpretation guide

🎓 References

AG-UI Documentation: https://docs.ag-ui.com
AG-UI Quickstart: https://docs.ag-ui.com/quickstart/applications
Compatible frameworks: LangGraph, Google ADK, Pydantic AI, Mastra

Ready for review! All tests passing ✅ (33/33)

Nov 01 '25 22:11 contextablemark

Looks great @contextablemark Thanks for the PR 🙌🏼

Please check the formatting stuff which fails the CI. Run: make run-ci locally to check it all.

Would you also mind adding a docs page for the integration as well in here?

Sure... sounds good. I also see the code quality issues that need addressing that are coming out in the check builds.

Nov 03 '25 18:11 contextablemark

Looks great @contextablemark Thanks for the PR 🙌🏼

Please check the formatting stuff which fails the CI. Run: make run-ci locally to check it all.

Would you also mind adding a docs page for the integration as well in here?

@anistark Please re-review. make run-ci should be passing now and I added "How-to" docs along with the .ipynb (and associated generated .md). Please let me know if anything else is needed.

Nov 04 '25 05:11 contextablemark

@anistark It appears that there were a couple of additional issues that crept in once I added the examples. The build should be clean now.

Nov 04 '25 07:11 contextablemark

Thanks for the update @contextablemark

We're refactoring our metrics approach from LangchainLLMWrapper to work with InstructorLLM via llm_factory.

While your code update works with both, the doc shows the earlier approach. We're yet to write a detailed guide on the migration, but since we're adding this at such a stage, would be great to have it in the newer structure to avoid being updated again by next week. :)

Checkout more info on the implementation changes in new metrics collections approach: /src/ragas/metrics/collections/

Sure... I had tried using it initially when I saw the deprecation warning message, but had some issues - I'll take another look.

Nov 04 '25 14:11 contextablemark

@anistark Looking into the refactoring raised some additional issues/questions regarding other steps in the workflow :

Support in the core evaluator: It seems that ragas.evaluate doesn’t recognize collections metrics yet and still expects legacy Metric subclasses. Is there a plan in the works to bring collections metrics into the core evaluator?
Blended metrics: - Along similar lines, if a workflow needs both legacy and collections metrics, is there a recommended “bridge” pattern—or should we avoid mixing the two until the execution pipeline handles both families?
Manual evaluation path: In the interim, is it acceptable to call metric.ascore(...) manually for the Instructor-based metrics and stitch the results together (i.e., handling all of the orchestration directly)?
Documentation guidance: Should any documentation mention the manual evaluation as an interim approach, hinting at something else to come?

Just trying to figure out whether it makes sense to wait until next week if some changes are imminent that will make the overall implementation easier.

Mark

Nov 04 '25 16:11 contextablemark

Support in the core evaluator: It seems that ragas.evaluate doesn’t recognize collections metrics yet and still expects legacy Metric subclasses. Is there a plan in the works to bring collections metrics into the core evaluator?

evaluate will be deprecated once all metrics are migrated to collections.

Blended metrics: - Along similar lines, if a workflow needs both legacy and collections metrics, is there a recommended “bridge” pattern—or should we avoid mixing the two until the execution pipeline handles both families?

I think we can focus on collections as of now going forward. We'll support legacy evaluate till a certain version (undecided) and then completely remove.

Manual evaluation path: In the interim, is it acceptable to call metric.ascore(...) manually for the Instructor-based metrics and stitch the results together (i.e., handling all of the orchestration directly)?

While manually is fine, but better to align with rest of it, so we don't have to make changes again in couple of weeks.

Documentation guidance: Should any documentation mention the manual evaluation as an interim approach, hinting at something else to come?

If you want to do manual, then yes. Otherwise, not required.

Nov 04 '25 18:11 anistark

@anistark Thanks for the answers to my questions. I'm starting to think that my integration may be attempting to do too much; in particular the "evaluate_ag_ui_agent" method has at its core ragas.evaluate, which is going away. And if this PR is reflective of the intended direction of the overall project, I may need to rethink my examples.

Do you think it would make sense for me to wait for the "dust to settle" and revisit the topic next week after more of the changes have been merged? If so, I'll move this PR back to draft.

Thanks, Mark

Nov 05 '25 03:11 contextablemark

Do you think it would make sense for me to wait for the "dust to settle" and revisit the topic next week after more of the changes have been merged? If so, I'll move this PR back to draft.

Sure, if that's what you want. Can park it till all migrations are done, with docs update.

Nov 05 '25 06:11 anistark

Do you think it would make sense for me to wait for the "dust to settle" and revisit the topic next week after more of the changes have been merged? If so, I'll move this PR back to draft.

Sure, if that's what you want. Can park it till all migrations are done, with docs update.

Thanks... I'll keep an eye on the situation

Nov 05 '25 09:11 contextablemark

Hi @anistark - I've updated my examples to use the collection metrics where appropriate. Please take another look.

Nov 19 '25 08:11 contextablemark

Reverting to draft - the .ipynb notebook still needs to be updated.

Nov 19 '25 10:11 contextablemark

@contextablemark We've released v0.4.0

Please check out and update your PR accordingly, so we can get it merged soon. :)

Dec 04 '25 07:12 anistark

@anistark I've made some changes that hopefully bring the integration into alignment with 0.4.0. Please let me know what you think!

Dec 07 '25 22:12 contextablemark

@anistark I did some refactoring and simplification to (I think) bring the integration in line with the "@experiment" paradigm (removing one third of the code in the process). Please take another look when you get a chance.

Dec 09 '25 20:12 contextablemark

@anistark Any additional suggestions?

Dec 12 '25 04:12 contextablemark

🎉

Dec 14 '25 15:12 contextablemark