autogen [Bug]: Cannot use MultimodalConversableAgent with anthropic models

Describe the bug

MultimodalConversableAgent doesn't support anthropic models

Steps to reproduce

if you run this:

from autogen.agentchat.contrib.multimodal_conversable_agent import MultimodalConversableAgent  # for GPT-4V


prompt = """What's the breed of this dog?
<img https://th.bing.com/th/id/R.422068ce8af4e15b0634fe2540adea7a?rik=y4OcXBE%2fqutDOw&pid=ImgRaw&r=0>."""

assistant = MultimodalConversableAgent(
    "assistant",
    llm_config={
        "config_list": [
            {
                # Choose your model name.
                "model": "claude-3-5-sonnet-20240620",
                # You need to provide your API key here.
                "api_key": os.getenv("ANTHROPIC_API_KEY"),
                "api_type": "anthropic",
            }
        ],
        "cache_seed": None,
    },
)

user_proxy = autogen.UserProxyAgent(
    "user_proxy",
    human_input_mode="NEVER",
    code_execution_config={
        "work_dir": "coding",
        "use_docker": False,
    },
    is_termination_msg=lambda x: x.get("content", "") and x.get("content", "").rstrip().endswith("TERMINATE"),
    max_consecutive_auto_reply=1,
)

user_proxy.initiate_chat(assistant, message=prompt)

you will get the following error

BadRequestError                           Traceback (most recent call last)
Cell In[16], [line 34](vscode-notebook-cell:?execution_count=16&line=34)
      [7](vscode-notebook-cell:?execution_count=16&line=7) assistant = MultimodalConversableAgent(
      [8](vscode-notebook-cell:?execution_count=16&line=8)     "assistant",
      [9](vscode-notebook-cell:?execution_count=16&line=9)     llm_config={
   (...)
     [20](vscode-notebook-cell:?execution_count=16&line=20)     },
     [21](vscode-notebook-cell:?execution_count=16&line=21) )
     [23](vscode-notebook-cell:?execution_count=16&line=23) user_proxy = autogen.UserProxyAgent(
     [24](vscode-notebook-cell:?execution_count=16&line=24)     "user_proxy",
     [25](vscode-notebook-cell:?execution_count=16&line=25)     human_input_mode="NEVER",
   (...)
     [31](vscode-notebook-cell:?execution_count=16&line=31)     max_consecutive_auto_reply=1,
     [32](vscode-notebook-cell:?execution_count=16&line=32) )
---> [34](vscode-notebook-cell:?execution_count=16&line=34) user_proxy.initiate_chat(assistant, message=prompt)

File ~/src/intuned-playwriter/.venv/lib/python3.11/site-packages/autogen/agentchat/conversable_agent.py:1019, in ConversableAgent.initiate_chat(self, recipient, clear_history, silent, cache, max_turns, summary_method, summary_args, message, **kwargs)
   [1017](https://file+.vscode-resource.vscode-cdn.net/Users/faisal/src/intuned-playwriter/notebooks/~/src/intuned-playwriter/.venv/lib/python3.11/site-packages/autogen/agentchat/conversable_agent.py:1017)     else:
   [1018](https://file+.vscode-resource.vscode-cdn.net/Users/faisal/src/intuned-playwriter/notebooks/~/src/intuned-playwriter/.venv/lib/python3.11/site-packages/autogen/agentchat/conversable_agent.py:1018)         msg2send = self.generate_init_message(message, **kwargs)
-> [1019](https://file+.vscode-resource.vscode-cdn.net/Users/faisal/src/intuned-playwriter/notebooks/~/src/intuned-playwriter/.venv/lib/python3.11/site-packages/autogen/agentchat/conversable_agent.py:1019)     self.send(msg2send, recipient, silent=silent)
   [1020](https://file+.vscode-resource.vscode-cdn.net/Users/faisal/src/intuned-playwriter/notebooks/~/src/intuned-playwriter/.venv/lib/python3.11/site-packages/autogen/agentchat/conversable_agent.py:1020) summary = self._summarize_chat(
   [1021](https://file+.vscode-resource.vscode-cdn.net/Users/faisal/src/intuned-playwriter/notebooks/~/src/intuned-playwriter/.venv/lib/python3.11/site-packages/autogen/agentchat/conversable_agent.py:1021)     summary_method,
   [1022](https://file+.vscode-resource.vscode-cdn.net/Users/faisal/src/intuned-playwriter/notebooks/~/src/intuned-playwriter/.venv/lib/python3.11/site-packages/autogen/agentchat/conversable_agent.py:1022)     summary_args,
   [1023](https://file+.vscode-resource.vscode-cdn.net/Users/faisal/src/intuned-playwriter/notebooks/~/src/intuned-playwriter/.venv/lib/python3.11/site-packages/autogen/agentchat/conversable_agent.py:1023)     recipient,
...
   (...)
   [1053](https://file+.vscode-resource.vscode-cdn.net/Users/faisal/src/intuned-playwriter/notebooks/~/src/intuned-playwriter/.venv/lib/python3.11/site-packages/anthropic/_base_client.py:1053)     stream_cls=stream_cls,
   [1054](https://file+.vscode-resource.vscode-cdn.net/Users/faisal/src/intuned-playwriter/notebooks/~/src/intuned-playwriter/.venv/lib/python3.11/site-packages/anthropic/_base_client.py:1054) )

BadRequestError: Error code: 400 - {'type': 'error', 'error': {'type': 'invalid_request_error', 'message': "messages.0.content.1: Input tag 'image_url' found using 'type' does not match any of the expected tags: 'text', 'image', 'tool_use', 'tool_result'"}}

Model Used

claude-3-5-sonnet-20240620

Expected Behavior

it should work like it does for gpt-4o

Screenshots and logs

CleanShot 2024-07-31 at 15 30 05

Additional Information

No response

Jul 31 '24 22:07 faisalil

Hey @faisalil, unfortunately the Anthropic client class in AutoGen doesn't support multi-modality. If you're able to contribute please feel free to create a Pull Request. I'm happy to help test it :).

Aug 01 '24 04:08 marklysze

@faisalil, I'll mark this as closed for now. If you're able to contribute please feel free to create a Pull Request and I can assist you with it.

Aug 16 '24 03:08 marklysze

Hi @faisalil, I was coming up against the same issue. You can get vision to work with Anthropic by converting the content part of the message containing the image to a format that the Anthropic API accepts. In anthropic.py, replace the function oai_messages_to_anthropic_messages with the three functions below (incidentally, written in full by the new Sonnet 3.5). It also fixes an error I had where the part that reformats the system prompt had type error expecting a string when a dictionary was provided. I haven't tested it thoroughly, but has been working fine so far with local images. Have not tested with remote URLs.

def process_image_content(content_item: Dict[str, Any]) -> Dict[str, Any]:
    """Process an OpenAI image content item into Claude format."""
    if content_item['type'] != 'image_url':
        return content_item

    url = content_item['image_url']['url']
    try:
        # Handle data URLs
        if url.startswith('data:'):
            data_url_pattern = r'data:image/([a-zA-Z]+);base64,(.+)'
            match = re.match(data_url_pattern, url)
            if match:
                media_type, base64_data = match.groups()
                return {
                    'type': 'image',
                    'source': {
                        'type': 'base64',
                        'media_type': f'image/{media_type}',
                        'data': base64_data
                    }
                }

        # Handle remote URLs
        else:
            response = requests.get(url)
            response.raise_for_status()
            content_type = response.headers.get('content-type', 'image/jpeg')
            return {
                'type': 'image',
                'source': {
                    'type': 'base64',
                    'media_type': content_type,
                    'data': base64.b64encode(response.content).decode('utf-8')
                }
            }
    except Exception as e:
        print(f"Error processing image URL: {e}")
        # Return original content if image processing fails
        return content_item


def process_message_content(message: Dict[str, Any]) -> Union[str, List[Dict[str, Any]]]:
    """Process message content, handling both string and list formats with images."""
    content = message.get("content", "")

    # Handle empty content
    if content == "":
        return content

    # If content is already a string, return as is
    if isinstance(content, str):
        return content

    # Handle list content (mixed text and images)
    if isinstance(content, list):
        processed_content = []
        for item in content:
            if item['type'] == 'text':
                processed_content.append({
                    'type': 'text',
                    'text': item['text']
                })
            elif item['type'] == 'image_url':
                processed_content.append(process_image_content(item))
        return processed_content

    return content


def oai_messages_to_anthropic_messages(params: Dict[str, Any]) -> list[dict[str, Any]]:
    """Convert messages from OAI format to Anthropic format.
    We correct for any specific role orders and types, etc.
    Handles both regular messages, tool calls, and images.
    """
    has_tools = "tools" in params
    processed_messages = []

    user_continue_message = {"content": "Please continue.", "role": "user"}
    assistant_continue_message = {
        "content": "Please continue.", "role": "assistant"}

    tool_use_messages = 0
    tool_result_messages = 0
    last_tool_use_index = -1
    last_tool_result_index = -1

    for message in params["messages"]:
        if message["role"] == "system":
            content = process_message_content(message)
            if isinstance(content, list):
                # For system messages with images, concatenate only the text portions
                text_content = " ".join(
                    item.get('text', '')
                    for item in content
                    if item.get('type') == 'text'
                )
                params["system"] = params.get(
                    "system", "") + (" " if "system" in params else "") + text_content
            else:
                params["system"] = params.get(
                    "system", "") + (" " if "system" in params else "") + content
        else:
            expected_role = "user" if len(
                processed_messages) % 2 == 0 else "assistant"

            if "tool_calls" in message:
                # Existing tool call handling
                tool_uses = []
                tool_names = []
                for tool_call in message["tool_calls"]:
                    tool_uses.append(
                        ToolUseBlock(
                            type="tool_use",
                            id=tool_call["id"],
                            name=tool_call["function"]["name"],
                            input=json.loads(
                                tool_call["function"]["arguments"]),
                        )
                    )
                    if has_tools:
                        tool_use_messages += 1
                    tool_names.append(tool_call["function"]["name"])

                if expected_role == "user":
                    processed_messages.append(user_continue_message)

                if has_tools:
                    processed_messages.append(
                        {"role": "assistant", "content": tool_uses})
                    last_tool_use_index = len(processed_messages) - 1
                else:
                    processed_messages.append(
                        {
                            "role": "assistant",
                            "content": f"Some internal function(s) that could be used: [{', '.join(tool_names)}]",
                        }
                    )
            elif "tool_call_id" in message:
                # Existing tool result handling
                if has_tools:
                    tool_result = {
                        "type": "tool_result",
                        "tool_use_id": message["tool_call_id"],
                        "content": message["content"],
                    }

                    if last_tool_result_index == len(processed_messages) - 1:
                        processed_messages[-1]["content"].append(tool_result)
                    else:
                        if expected_role == "assistant":
                            processed_messages.append(
                                assistant_continue_message)

                        processed_messages.append(
                            {"role": "user", "content": [tool_result]})
                        last_tool_result_index = len(processed_messages) - 1

                    tool_result_messages += 1
                else:
                    processed_messages.append(
                        {"role": "user", "content": f"Running the function returned: {
                            message['content']}"}
                    )
            elif message["content"] == "":
                pass  # Ignoring empty messages
            else:
                if expected_role != message["role"]:
                    processed_messages.append(
                        user_continue_message if expected_role == "user" else assistant_continue_message
                    )

                # Process the message content for images
                processed_content = process_message_content(message)
                processed_message = message.copy()
                processed_message["content"] = processed_content
                processed_messages.append(processed_message)

    # Handle unmatched tool_use/tool_result
    if has_tools and tool_use_messages != tool_result_messages:
        processed_messages[last_tool_use_index] = assistant_continue_message

    # Remove name field from messages
    for message in processed_messages:
        if "name" in message:
            message.pop("name", None)

    # Ensure last message is from user
    if processed_messages[-1]["role"] != "user":
        processed_messages.append(user_continue_message)

    return processed_messages

Oct 27 '24 04:10 jk10001

@jk10001, thanks so much for doing that and I'm sure it will be helpful for others - are you able to put this into a PR and then I can assist with testing it?

Part of the PR will be updating and adding tests in tests/oai/test_anthropic.py. If possible a basic update of the documentation to note or show the new multi-modal capability would also be great.

Oct 27 '24 19:10 marklysze

@marklysze, no worries. I'm pretty new to this. What is needed for the test file?

Oct 30 '24 20:10 jk10001

Hey @jk10001, sure, you can have a look at the current test file for anthropic, essentially each function has a matching test: https://github.com/microsoft/autogen/blob/0.2/test/oai/test_anthropic.py

Some tests may need to be updated based on your code. Feel free to create a PR if you wanted to proceed with getting your code in.

Nov 06 '24 04:11 marklysze