ag2 icon indicating copy to clipboard operation
ag2 copied to clipboard

fix: tool responses printing for MultimodalConversableAgent handles multimodal content format

Open Parth220 opened this issue 5 months ago • 4 comments

Why are these changes needed?

Fix for Autogen MultimodalConversableAgent Tool Response Issue

Problem

When using MultimodalConversableAgent with tool functions, the agent wraps tool responses in a multimodal format:

[{'type': 'text', 'text': 'actual response text'}]

However, the ToolResponseEvent in autogen expects content to be a simple string, causing validation errors:

ValidationError: Input should be a valid string [type=string_type, input_value=[{'type': 'text', 'text':...}], input_type=list]

Root Cause

The MultimodalConversableAgent formats all responses (including tool responses) in a multimodal format to support both text and images. But the event validation system expects tool responses to be simple strings.

Solution

Modified agent_events.py in the create_received_event_model function for "tool" responses to extract text from the multimodal format:

if role == "tool":
    # Handle multimodal content format - extract text if content is a list of dicts
    content = event.get('content')
    if isinstance(content, list) and len(content) > 0 and isinstance(content[0], dict):
        # Extract text from multimodal format [{'type': 'text', 'text': '...'}]
        text_parts = []
        for item in content:
            if isinstance(item, dict) and item.get('type') == 'text':
                text_parts.append(item.get('text', ''))
        event['content'] = ''.join(text_parts)

This extracts the actual text content from the multimodal format before creating the ToolResponseEvent.

Related issue number

N/A

Checks

  • [x] I've included any doc changes needed for https://docs.ag2.ai/. See https://docs.ag2.ai/latest/docs/contributor-guide/documentation/ to build and test documentation locally.
  • [x] I've added tests (if relevant) corresponding to the changes introduced in this PR.
  • [x] I've made sure all auto checks have passed.

Parth220 avatar Aug 18 '25 02:08 Parth220

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Aug 18 '25 02:08 CLAassistant

@Parth220 could you run pre-commit? Thanks! https://docs.ag2.ai/latest/docs/contributor-guide/pre-commit/

qingyun-wu avatar Aug 19 '25 19:08 qingyun-wu

Codecov Report

:x: Patch coverage is 14.28571% with 6 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
autogen/events/agent_events.py 14.28% 5 Missing and 1 partial :warning:
Files with missing lines Coverage Δ
autogen/events/agent_events.py 95.98% <14.28%> (-1.66%) :arrow_down:

... and 36 files with indirect coverage changes

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov[bot] avatar Aug 30 '25 20:08 codecov[bot]

Code Review for PR #2043

Summary

This PR addresses a validation error when using MultimodalConversableAgent with tool functions. The agent wraps tool responses in multimodal format [{'type': 'text', 'text': '...'}], but ToolResponseEvent expects a simple string, causing Pydantic validation errors.


Changes Review

1. autogen/events/agent_events.py (Lines 245-253)

Added: Multimodal content extraction for tool responses

if role == "tool":
    # Handle multimodal content format - extract text if content is a list of dicts
    content = event.get('content')
    if isinstance(content, list) and len(content) > 0 and isinstance(content[0], dict):
        # Extract text from multimodal format [{'type': 'text', 'text': '...'}]
        text_parts = []
        for item in content:
            if isinstance(item, dict) and item.get('type') == 'text':
                text_parts.append(item.get('text', ''))
        event['content'] = ''.join(text_parts)

Issues Found:

  1. ⚠️ Mutating Input Dictionary: The code directly modifies the event dictionary (event['content'] = ...), which could cause unexpected side effects if the same event object is used elsewhere. Consider using deepcopy or creating a new dict.

  2. ⚠️ Empty String Fallback: Using item.get('text', '') could silently hide missing text content. Consider logging a warning or raising an error for malformed content.

  3. ⚠️ Incomplete Type Checking: The check only verifies the first element is a dict but assumes all elements are. If the list contains mixed types, this could fail.

  4. ⚠️ Silent Content Loss: Non-text items (like images in {'type': 'image_url', ...}) are silently dropped. This might be intentional for tool responses, but should be documented.

2. autogen/agentchat/contrib/multimodal_conversable_agent.py (Lines 116-124)

Added: Tool response unpacking logic

# Fix tool response format for OpenAI API
fixed_messages = []
for msg in messages_with_b64_img:
    if isinstance(msg, dict) and msg.get("role") == "tool" and "tool_responses" in msg:
        # Unpack tool_responses to individual tool messages with tool_call_id
        for tool_response in msg["tool_responses"]:
            fixed_messages.append(tool_response)
    else:
        fixed_messages.append(msg)

Issues Found:

  1. ✅ Good: This properly unpacks tool responses for OpenAI API compatibility.

  2. ⚠️ Variable Naming: messages_with_b64_img is misleading since it's now being modified to also handle tool responses. Consider renaming to something like processed_messages.

  3. ⚠️ Line 128: The change from messages[-1].pop('context', None) to messages[-1].pop('context', None) if messages else None is good for preventing index errors, but this could be an unrelated fix that should be mentioned in the PR description.


Code Quality Assessment

Strengths:

  • ✅ Addresses a real bug with clear symptom description
  • ✅ Minimal, focused changes
  • ✅ Includes helpful inline comments
  • ✅ Properly handles empty lists in the safety check

Concerns:

  • Mutating input data: Direct modification of the event dict is a code smell
  • No error handling: Malformed multimodal content could cause silent failures
  • ⚠️ Mixed concerns: The multimodal_conversable_agent.py changes address two different issues

Test Coverage Issues

Major Concern: No Tests Added

The PR description states:

"I've added tests (if relevant) corresponding to the changes introduced in this PR."

However, no test files were modified in this PR. Codecov reports 14.28% patch coverage with 6 lines missing coverage.

Required Tests:

  1. ✅ Test case for tool response with multimodal format: [{'type': 'text', 'text': 'result'}]
  2. ✅ Test case for tool response with mixed content types
  3. ✅ Test case for empty multimodal list
  4. ✅ Test case for malformed multimodal content (missing 'text' key)
  5. ✅ Test case for unpacking tool_responses in MultimodalConversableAgent
  6. ✅ Integration test showing end-to-end tool call with MultimodalConversableAgent

Suggested Test Location:

  • test/events/test_agent_events.py - Add test to TestToolResponseEvent class
  • test/agentchat/test_multimodal_integration.py - Add integration test

Security Concerns

Low Risk, but consider:

  • Input validation: No checks for malicious content in multimodal payloads
  • Type confusion: Mixed types in content list could cause unexpected behavior

Performance Considerations

  • Minor overhead: Additional loop through content items for tool responses
  • Memory: Creating new fixed_messages list could be optimized for large message histories
  • Overall impact: Negligible for typical use cases

Recommendations

High Priority (Should Fix)

  1. Add comprehensive tests to achieve reasonable coverage (target 80%+)
  2. Avoid mutating input: Create a copy of the event dict before modification
  3. Add error handling: Validate multimodal content structure and provide clear error messages

Medium Priority (Should Consider)

  1. Document behavior: Add docstring explaining multimodal content extraction
  2. Type hints: Add type annotations to the new code sections
  3. Logging: Add debug logging for content transformation

Low Priority (Nice to Have)

  1. Refactor: Extract multimodal content parsing into a separate utility function
  2. Edge cases: Handle empty strings, None values, and nested structures

Suggested Code Improvements

For agent_events.py:

if role == "tool":
    # Handle multimodal content format - extract text if content is a list of dicts
    content = event.get('content')
    if isinstance(content, list) and len(content) > 0:
        # Extract text from multimodal format [{'type': 'text', 'text': '...'}]
        text_parts = []
        for item in content:
            if not isinstance(item, dict):
                continue  # Skip non-dict items
            if item.get('type') == 'text':
                text_value = item.get('text')
                if text_value is not None:
                    text_parts.append(str(text_value))
        # Only modify if we found text content
        if text_parts:
            event = event.copy()  # Don't mutate the original
            event['content'] = ''.join(text_parts)
    return ToolResponseEvent(**event, sender=sender.name, recipient=recipient.name, uuid=uuid)

Conclusion

Overall Assessment: Needs Work ⚠️

The PR addresses a legitimate issue with a reasonable approach, but has significant gaps:

  1. Critical: Missing test coverage - only 14.28%
  2. ⚠️ Important: Direct mutation of input data
  3. ⚠️ Important: No error handling for malformed content
  4. ℹ️ Minor: Code could be more robust and maintainable

Recommendation: Request changes

  • Add comprehensive tests before merging
  • Fix the input mutation issue
  • Add basic error handling

Estimated Effort to Fix: 2-3 hours

  • 1-2 hours for comprehensive tests
  • 30 min for code improvements
  • 30 min for documentation

Additional Notes

  • The changes are merged from multiple branches, make sure all tests pass
  • Consider adding a notebook example demonstrating the fix
  • Pre-commit hooks need to be run (mentioned in comments)

Let me know if you need help writing the tests or making the recommended improvements!


Review generated by Claude Code

github-actions[bot] avatar Oct 30 '25 18:10 github-actions[bot]

🔍 Claude Code Review - Re-review of PR #2043

Summary

This PR addresses a validation error when using MultimodalConversableAgent with tool functions. The agent wraps tool responses in multimodal format [{'type': 'text', 'text': '...'}], but ToolResponseEvent expects a simple string, causing Pydantic validation errors.

Previous Review Reference: A Claude review was posted on October 30, 2025. This is a re-review based on the latest changes.


✅ What's Working Well

  1. Clear Problem Definition: The PR description clearly articulates the issue with good examples
  2. Focused Solution: The fix is minimal and targeted at the specific problem
  3. Handles Edge Cases: The code checks for list type and dict structure before processing
  4. Safe Processing: Only extracts text from items with type == 'text', ignoring other content types

🚨 Critical Issues

1. Input Mutation Anti-Pattern

Location: autogen/events/agent_events.py:253

The code directly modifies the input event dictionary:

event['content'] = ''.join(text_parts)

Problem: This mutates shared state and can cause unexpected side effects if the event dict is used elsewhere.

Fix: Use copy or create a new dict:

from copy import copy
event = copy(event)
event['content'] = ''.join(text_parts)

2. Missing Test Coverage

CodeCov Report: Only 14.28% patch coverage with 6 lines missing coverage

Despite the PR checklist claiming "I've added tests", no test files were modified. The new multimodal content extraction logic in agent_events.py:245-253 is completely untested.

Required Tests:

  • ✅ Tool response with multimodal format: [{'type': 'text', 'text': 'result'}]
  • ✅ Tool response with multiple text parts
  • ✅ Tool response with mixed content types (text + image)
  • ✅ Tool response with empty list
  • ✅ Tool response with malformed content (missing 'text' key)
  • ✅ Integration test: End-to-end tool call with MultimodalConversableAgent

Suggested Test Location: Add to test/events/test_agent_events.py in the TestToolResponseEvent class.

3. Incomplete Type Checking

Location: autogen/events/agent_events.py:247

if isinstance(content, list) and len(content) > 0 and isinstance(content[0], dict):

Problem: Only checks if the first element is a dict, but assumes all elements are dicts in the loop.

Fix:

for item in content:
    if not isinstance(item, dict):
        continue  # Skip non-dict items gracefully
    if item.get('type') == 'text':
        # ...

⚠️ Important Issues

4. Silent Data Loss

Location: autogen/events/agent_events.py:252

text_parts.append(item.get('text', ''))

Problem: Using .get('text', '') silently returns empty string if 'text' key is missing. This masks data quality issues.

Recommendation: Consider logging a warning or being more explicit:

text_value = item.get('text')
if text_value is not None:
    text_parts.append(str(text_value))
# Or: log a warning if text is missing

5. Inconsistent Variable Naming

Location: autogen/agentchat/contrib/multimodal_conversable_agent.py:138

messages_with_b64_img = message_formatter_pil_to_b64(...)
# ... later used for tool response fixing

Problem: Variable name messages_with_b64_img is misleading since it now handles both image formatting AND tool response unpacking.

Suggestion: Rename to processed_messages or add a comment explaining the dual purpose.

6. Unrelated Fix Mixed In

Location: autogen/agentchat/contrib/multimodal_conversable_agent.py:152

context=messages[-1].pop("context", None) if messages else None

Problem: This defensive check (if messages) wasn't mentioned in the PR description and appears to be fixing an unrelated potential IndexError.

Recommendation: Either mention this in the PR description or extract to a separate commit.


💡 Code Quality Suggestions

7. Extract to Helper Function

The multimodal content extraction logic (lines 245-253) could be extracted into a reusable helper function:

def extract_text_from_multimodal_content(content: Any) -> str:
    """Extract text from multimodal content format.
    
    Args:
        content: Content that may be in multimodal format [{'type': 'text', 'text': '...'}]
    
    Returns:
        Concatenated text content as string
    """
    if not isinstance(content, list) or len(content) == 0:
        return content
    
    text_parts = []
    for item in content:
        if isinstance(item, dict) and item.get('type') == 'text':
            text_value = item.get('text')
            if text_value is not None:
                text_parts.append(str(text_value))
    
    return ''.join(text_parts) if text_parts else content

This would improve:

  • Testability (unit test the extraction logic separately)
  • Reusability (if other code needs similar extraction)
  • Readability (clearer intent)

8. Add Type Hints

The modified functions lack type hints for the new code paths. Consider adding them for better IDE support and documentation.

9. Add Debug Logging

Consider adding debug-level logging to help diagnose multimodal content issues in production:

import logging
logger = logging.getLogger(__name__)

if role == "tool":
    content = event.get('content')
    if isinstance(content, list) and len(content) > 0:
        logger.debug(f"Extracting text from multimodal tool response with {len(content)} items")
        # ... extraction logic

🔒 Security Assessment

Status:No security concerns identified

  • No injection vulnerabilities introduced
  • No unsafe deserialization
  • Content type validation is present
  • No credential exposure

🏎️ Performance Assessment

Status:Negligible performance impact

  • Additional loop through content items only for tool responses with multimodal content
  • Most tool responses are simple strings and skip this processing
  • Memory overhead is minimal (one new list for text_parts)

📋 Checklist Status

Based on PR description claims vs. actual changes:

  • [x] Doc changes included (N/A for this change)
  • [❌] Tests added - CLAIMED BUT NOT PRESENT ⚠️
  • [x] Auto checks passed (after pre-commit)

🎯 Recommendations

Must Fix Before Merge:

  1. Add comprehensive tests to achieve at least 80% coverage of new code paths
  2. Fix input mutation - use copy() to avoid modifying the original event dict
  3. Improve type checking - don't assume all list items are dicts

Should Fix:

  1. Address silent data loss with .get('text', '')
  2. Update PR description to mention the if messages defensive fix
  3. Consider extracting to a helper function for better testability

Nice to Have:

  1. Add type hints for new code
  2. Add debug logging for production debugging
  3. Rename messages_with_b64_img to reflect dual purpose

📝 Suggested Code Fix

Here's a recommended implementation for agent_events.py:244-254:

if role == "tool":
    # Handle multimodal content format - extract text if content is a list of dicts
    content = event.get('content')
    if isinstance(content, list) and len(content) > 0:
        # Extract text from multimodal format [{'type': 'text', 'text': '...'}]
        text_parts = []
        for item in content:
            if not isinstance(item, dict):
                continue  # Skip non-dict items
            if item.get('type') == 'text':
                text_value = item.get('text')
                if text_value is not None:
                    text_parts.append(str(text_value))
        
        if text_parts:
            # Create a shallow copy to avoid mutating the original
            from copy import copy
            event = copy(event)
            event['content'] = ''.join(text_parts)
    
    return ToolResponseEvent(**event, sender=sender.name, recipient=recipient.name, uuid=uuid)

🎓 Testing Example

Add this test to test/events/test_agent_events.py:

def test_tool_response_multimodal_content(
    uuid: UUID, sender: ConversableAgent, recipient: ConversableAgent
) -> None:
    """Test that tool responses with multimodal content are properly extracted."""
    # Test multimodal format [{'type': 'text', 'text': '...'}]
    event = {
        "role": "tool",
        "tool_responses": [
            {"tool_call_id": "call_123", "role": "tool", 
             "content": [{"type": "text", "text": "Result: Success"}]}
        ],
        "content": [{"type": "text", "text": "Result: Success"}],
    }
    
    actual = create_received_event_model(uuid=uuid, event=event, sender=sender, recipient=recipient)
    assert isinstance(actual, ToolResponseEvent)
    
    # Content should be extracted to string
    assert actual.content == "Result: Success"
    assert isinstance(actual.content, str)

🏁 Final Verdict

Status: ⚠️ CHANGES REQUESTED

The PR addresses a real bug with a reasonable approach, but has significant gaps:

  1. Critical: Missing test coverage (only 14.28%)
  2. Important: Input mutation anti-pattern
  3. Important: Incomplete type checking

Estimated Effort to Address: 2-3 hours

  • 1-2 hours for comprehensive tests
  • 30 min for code improvements
  • 30 min for review updates

Review generated by Claude Code (Sonnet 4.5) on 2025-12-15

github-actions[bot] avatar Dec 15 '25 20:12 github-actions[bot]