fix: tool responses printing for MultimodalConversableAgent handles multimodal content format
Why are these changes needed?
Fix for Autogen MultimodalConversableAgent Tool Response Issue
Problem
When using MultimodalConversableAgent with tool functions, the agent wraps tool responses in a multimodal format:
[{'type': 'text', 'text': 'actual response text'}]
However, the ToolResponseEvent in autogen expects content to be a simple string, causing validation errors:
ValidationError: Input should be a valid string [type=string_type, input_value=[{'type': 'text', 'text':...}], input_type=list]
Root Cause
The MultimodalConversableAgent formats all responses (including tool responses) in a multimodal format to support both text and images. But the event validation system expects tool responses to be simple strings.
Solution
Modified agent_events.py in the create_received_event_model function for "tool" responses to extract text from the multimodal format:
if role == "tool":
# Handle multimodal content format - extract text if content is a list of dicts
content = event.get('content')
if isinstance(content, list) and len(content) > 0 and isinstance(content[0], dict):
# Extract text from multimodal format [{'type': 'text', 'text': '...'}]
text_parts = []
for item in content:
if isinstance(item, dict) and item.get('type') == 'text':
text_parts.append(item.get('text', ''))
event['content'] = ''.join(text_parts)
This extracts the actual text content from the multimodal format before creating the ToolResponseEvent.
Related issue number
N/A
Checks
- [x] I've included any doc changes needed for https://docs.ag2.ai/. See https://docs.ag2.ai/latest/docs/contributor-guide/documentation/ to build and test documentation locally.
- [x] I've added tests (if relevant) corresponding to the changes introduced in this PR.
- [x] I've made sure all auto checks have passed.
@Parth220 could you run pre-commit? Thanks! https://docs.ag2.ai/latest/docs/contributor-guide/pre-commit/
Codecov Report
:x: Patch coverage is 14.28571% with 6 lines in your changes missing coverage. Please review.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| autogen/events/agent_events.py | 14.28% | 5 Missing and 1 partial :warning: |
| Files with missing lines | Coverage Δ | |
|---|---|---|
| autogen/events/agent_events.py | 95.98% <14.28%> (-1.66%) |
:arrow_down: |
... and 36 files with indirect coverage changes
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
Code Review for PR #2043
Summary
This PR addresses a validation error when using MultimodalConversableAgent with tool functions. The agent wraps tool responses in multimodal format [{'type': 'text', 'text': '...'}], but ToolResponseEvent expects a simple string, causing Pydantic validation errors.
Changes Review
1. autogen/events/agent_events.py (Lines 245-253)
Added: Multimodal content extraction for tool responses
if role == "tool":
# Handle multimodal content format - extract text if content is a list of dicts
content = event.get('content')
if isinstance(content, list) and len(content) > 0 and isinstance(content[0], dict):
# Extract text from multimodal format [{'type': 'text', 'text': '...'}]
text_parts = []
for item in content:
if isinstance(item, dict) and item.get('type') == 'text':
text_parts.append(item.get('text', ''))
event['content'] = ''.join(text_parts)
Issues Found:
-
⚠️ Mutating Input Dictionary: The code directly modifies the
eventdictionary (event['content'] = ...), which could cause unexpected side effects if the same event object is used elsewhere. Consider usingdeepcopyor creating a new dict. -
⚠️ Empty String Fallback: Using
item.get('text', '')could silently hide missing text content. Consider logging a warning or raising an error for malformed content. -
⚠️ Incomplete Type Checking: The check only verifies the first element is a dict but assumes all elements are. If the list contains mixed types, this could fail.
-
⚠️ Silent Content Loss: Non-text items (like images in
{'type': 'image_url', ...}) are silently dropped. This might be intentional for tool responses, but should be documented.
2. autogen/agentchat/contrib/multimodal_conversable_agent.py (Lines 116-124)
Added: Tool response unpacking logic
# Fix tool response format for OpenAI API
fixed_messages = []
for msg in messages_with_b64_img:
if isinstance(msg, dict) and msg.get("role") == "tool" and "tool_responses" in msg:
# Unpack tool_responses to individual tool messages with tool_call_id
for tool_response in msg["tool_responses"]:
fixed_messages.append(tool_response)
else:
fixed_messages.append(msg)
Issues Found:
-
✅ Good: This properly unpacks tool responses for OpenAI API compatibility.
-
⚠️ Variable Naming:
messages_with_b64_imgis misleading since it's now being modified to also handle tool responses. Consider renaming to something likeprocessed_messages. -
⚠️ Line 128: The change from
messages[-1].pop('context', None)tomessages[-1].pop('context', None) if messages else Noneis good for preventing index errors, but this could be an unrelated fix that should be mentioned in the PR description.
Code Quality Assessment
Strengths:
- ✅ Addresses a real bug with clear symptom description
- ✅ Minimal, focused changes
- ✅ Includes helpful inline comments
- ✅ Properly handles empty lists in the safety check
Concerns:
- ❌ Mutating input data: Direct modification of the
eventdict is a code smell - ❌ No error handling: Malformed multimodal content could cause silent failures
- ⚠️ Mixed concerns: The multimodal_conversable_agent.py changes address two different issues
Test Coverage Issues
Major Concern: No Tests Added
The PR description states:
"I've added tests (if relevant) corresponding to the changes introduced in this PR."
However, no test files were modified in this PR. Codecov reports 14.28% patch coverage with 6 lines missing coverage.
Required Tests:
- ✅ Test case for tool response with multimodal format:
[{'type': 'text', 'text': 'result'}] - ✅ Test case for tool response with mixed content types
- ✅ Test case for empty multimodal list
- ✅ Test case for malformed multimodal content (missing 'text' key)
- ✅ Test case for unpacking tool_responses in MultimodalConversableAgent
- ✅ Integration test showing end-to-end tool call with MultimodalConversableAgent
Suggested Test Location:
-
test/events/test_agent_events.py- Add test toTestToolResponseEventclass -
test/agentchat/test_multimodal_integration.py- Add integration test
Security Concerns
Low Risk, but consider:
- Input validation: No checks for malicious content in multimodal payloads
- Type confusion: Mixed types in content list could cause unexpected behavior
Performance Considerations
- Minor overhead: Additional loop through content items for tool responses
-
Memory: Creating new
fixed_messageslist could be optimized for large message histories - Overall impact: Negligible for typical use cases
Recommendations
High Priority (Should Fix)
- Add comprehensive tests to achieve reasonable coverage (target 80%+)
- Avoid mutating input: Create a copy of the event dict before modification
- Add error handling: Validate multimodal content structure and provide clear error messages
Medium Priority (Should Consider)
- Document behavior: Add docstring explaining multimodal content extraction
- Type hints: Add type annotations to the new code sections
- Logging: Add debug logging for content transformation
Low Priority (Nice to Have)
- Refactor: Extract multimodal content parsing into a separate utility function
- Edge cases: Handle empty strings, None values, and nested structures
Suggested Code Improvements
For agent_events.py:
if role == "tool":
# Handle multimodal content format - extract text if content is a list of dicts
content = event.get('content')
if isinstance(content, list) and len(content) > 0:
# Extract text from multimodal format [{'type': 'text', 'text': '...'}]
text_parts = []
for item in content:
if not isinstance(item, dict):
continue # Skip non-dict items
if item.get('type') == 'text':
text_value = item.get('text')
if text_value is not None:
text_parts.append(str(text_value))
# Only modify if we found text content
if text_parts:
event = event.copy() # Don't mutate the original
event['content'] = ''.join(text_parts)
return ToolResponseEvent(**event, sender=sender.name, recipient=recipient.name, uuid=uuid)
Conclusion
Overall Assessment: Needs Work ⚠️
The PR addresses a legitimate issue with a reasonable approach, but has significant gaps:
- ❌ Critical: Missing test coverage - only 14.28%
- ⚠️ Important: Direct mutation of input data
- ⚠️ Important: No error handling for malformed content
- ℹ️ Minor: Code could be more robust and maintainable
Recommendation: Request changes
- Add comprehensive tests before merging
- Fix the input mutation issue
- Add basic error handling
Estimated Effort to Fix: 2-3 hours
- 1-2 hours for comprehensive tests
- 30 min for code improvements
- 30 min for documentation
Additional Notes
- The changes are merged from multiple branches, make sure all tests pass
- Consider adding a notebook example demonstrating the fix
- Pre-commit hooks need to be run (mentioned in comments)
Let me know if you need help writing the tests or making the recommended improvements!
Review generated by Claude Code
🔍 Claude Code Review - Re-review of PR #2043
Summary
This PR addresses a validation error when using MultimodalConversableAgent with tool functions. The agent wraps tool responses in multimodal format [{'type': 'text', 'text': '...'}], but ToolResponseEvent expects a simple string, causing Pydantic validation errors.
Previous Review Reference: A Claude review was posted on October 30, 2025. This is a re-review based on the latest changes.
✅ What's Working Well
- Clear Problem Definition: The PR description clearly articulates the issue with good examples
- Focused Solution: The fix is minimal and targeted at the specific problem
- Handles Edge Cases: The code checks for list type and dict structure before processing
-
Safe Processing: Only extracts text from items with
type == 'text', ignoring other content types
🚨 Critical Issues
1. Input Mutation Anti-Pattern
Location: autogen/events/agent_events.py:253
The code directly modifies the input event dictionary:
event['content'] = ''.join(text_parts)
Problem: This mutates shared state and can cause unexpected side effects if the event dict is used elsewhere.
Fix: Use copy or create a new dict:
from copy import copy
event = copy(event)
event['content'] = ''.join(text_parts)
2. Missing Test Coverage ❌
CodeCov Report: Only 14.28% patch coverage with 6 lines missing coverage
Despite the PR checklist claiming "I've added tests", no test files were modified. The new multimodal content extraction logic in agent_events.py:245-253 is completely untested.
Required Tests:
- ✅ Tool response with multimodal format:
[{'type': 'text', 'text': 'result'}] - ✅ Tool response with multiple text parts
- ✅ Tool response with mixed content types (text + image)
- ✅ Tool response with empty list
- ✅ Tool response with malformed content (missing 'text' key)
- ✅ Integration test: End-to-end tool call with MultimodalConversableAgent
Suggested Test Location: Add to test/events/test_agent_events.py in the TestToolResponseEvent class.
3. Incomplete Type Checking
Location: autogen/events/agent_events.py:247
if isinstance(content, list) and len(content) > 0 and isinstance(content[0], dict):
Problem: Only checks if the first element is a dict, but assumes all elements are dicts in the loop.
Fix:
for item in content:
if not isinstance(item, dict):
continue # Skip non-dict items gracefully
if item.get('type') == 'text':
# ...
⚠️ Important Issues
4. Silent Data Loss
Location: autogen/events/agent_events.py:252
text_parts.append(item.get('text', ''))
Problem: Using .get('text', '') silently returns empty string if 'text' key is missing. This masks data quality issues.
Recommendation: Consider logging a warning or being more explicit:
text_value = item.get('text')
if text_value is not None:
text_parts.append(str(text_value))
# Or: log a warning if text is missing
5. Inconsistent Variable Naming
Location: autogen/agentchat/contrib/multimodal_conversable_agent.py:138
messages_with_b64_img = message_formatter_pil_to_b64(...)
# ... later used for tool response fixing
Problem: Variable name messages_with_b64_img is misleading since it now handles both image formatting AND tool response unpacking.
Suggestion: Rename to processed_messages or add a comment explaining the dual purpose.
6. Unrelated Fix Mixed In
Location: autogen/agentchat/contrib/multimodal_conversable_agent.py:152
context=messages[-1].pop("context", None) if messages else None
Problem: This defensive check (if messages) wasn't mentioned in the PR description and appears to be fixing an unrelated potential IndexError.
Recommendation: Either mention this in the PR description or extract to a separate commit.
💡 Code Quality Suggestions
7. Extract to Helper Function
The multimodal content extraction logic (lines 245-253) could be extracted into a reusable helper function:
def extract_text_from_multimodal_content(content: Any) -> str:
"""Extract text from multimodal content format.
Args:
content: Content that may be in multimodal format [{'type': 'text', 'text': '...'}]
Returns:
Concatenated text content as string
"""
if not isinstance(content, list) or len(content) == 0:
return content
text_parts = []
for item in content:
if isinstance(item, dict) and item.get('type') == 'text':
text_value = item.get('text')
if text_value is not None:
text_parts.append(str(text_value))
return ''.join(text_parts) if text_parts else content
This would improve:
- Testability (unit test the extraction logic separately)
- Reusability (if other code needs similar extraction)
- Readability (clearer intent)
8. Add Type Hints
The modified functions lack type hints for the new code paths. Consider adding them for better IDE support and documentation.
9. Add Debug Logging
Consider adding debug-level logging to help diagnose multimodal content issues in production:
import logging
logger = logging.getLogger(__name__)
if role == "tool":
content = event.get('content')
if isinstance(content, list) and len(content) > 0:
logger.debug(f"Extracting text from multimodal tool response with {len(content)} items")
# ... extraction logic
🔒 Security Assessment
Status: ✅ No security concerns identified
- No injection vulnerabilities introduced
- No unsafe deserialization
- Content type validation is present
- No credential exposure
🏎️ Performance Assessment
Status: ✅ Negligible performance impact
- Additional loop through content items only for tool responses with multimodal content
- Most tool responses are simple strings and skip this processing
- Memory overhead is minimal (one new list for
text_parts)
📋 Checklist Status
Based on PR description claims vs. actual changes:
- [x] Doc changes included (N/A for this change)
- [❌] Tests added - CLAIMED BUT NOT PRESENT ⚠️
- [x] Auto checks passed (after pre-commit)
🎯 Recommendations
Must Fix Before Merge:
- Add comprehensive tests to achieve at least 80% coverage of new code paths
-
Fix input mutation - use
copy()to avoid modifying the originaleventdict - Improve type checking - don't assume all list items are dicts
Should Fix:
- Address silent data loss with
.get('text', '') - Update PR description to mention the
if messagesdefensive fix - Consider extracting to a helper function for better testability
Nice to Have:
- Add type hints for new code
- Add debug logging for production debugging
- Rename
messages_with_b64_imgto reflect dual purpose
📝 Suggested Code Fix
Here's a recommended implementation for agent_events.py:244-254:
if role == "tool":
# Handle multimodal content format - extract text if content is a list of dicts
content = event.get('content')
if isinstance(content, list) and len(content) > 0:
# Extract text from multimodal format [{'type': 'text', 'text': '...'}]
text_parts = []
for item in content:
if not isinstance(item, dict):
continue # Skip non-dict items
if item.get('type') == 'text':
text_value = item.get('text')
if text_value is not None:
text_parts.append(str(text_value))
if text_parts:
# Create a shallow copy to avoid mutating the original
from copy import copy
event = copy(event)
event['content'] = ''.join(text_parts)
return ToolResponseEvent(**event, sender=sender.name, recipient=recipient.name, uuid=uuid)
🎓 Testing Example
Add this test to test/events/test_agent_events.py:
def test_tool_response_multimodal_content(
uuid: UUID, sender: ConversableAgent, recipient: ConversableAgent
) -> None:
"""Test that tool responses with multimodal content are properly extracted."""
# Test multimodal format [{'type': 'text', 'text': '...'}]
event = {
"role": "tool",
"tool_responses": [
{"tool_call_id": "call_123", "role": "tool",
"content": [{"type": "text", "text": "Result: Success"}]}
],
"content": [{"type": "text", "text": "Result: Success"}],
}
actual = create_received_event_model(uuid=uuid, event=event, sender=sender, recipient=recipient)
assert isinstance(actual, ToolResponseEvent)
# Content should be extracted to string
assert actual.content == "Result: Success"
assert isinstance(actual.content, str)
🏁 Final Verdict
Status: ⚠️ CHANGES REQUESTED
The PR addresses a real bug with a reasonable approach, but has significant gaps:
- Critical: Missing test coverage (only 14.28%)
- Important: Input mutation anti-pattern
- Important: Incomplete type checking
Estimated Effort to Address: 2-3 hours
- 1-2 hours for comprehensive tests
- 30 min for code improvements
- 30 min for review updates
Review generated by Claude Code (Sonnet 4.5) on 2025-12-15