[Bug]: Cannot use MultimodalConversableAgent with anthropic models
Describe the bug
MultimodalConversableAgent doesn't support anthropic models
Steps to reproduce
if you run this:
from autogen.agentchat.contrib.multimodal_conversable_agent import MultimodalConversableAgent # for GPT-4V
prompt = """What's the breed of this dog?
<img https://th.bing.com/th/id/R.422068ce8af4e15b0634fe2540adea7a?rik=y4OcXBE%2fqutDOw&pid=ImgRaw&r=0>."""
assistant = MultimodalConversableAgent(
"assistant",
llm_config={
"config_list": [
{
# Choose your model name.
"model": "claude-3-5-sonnet-20240620",
# You need to provide your API key here.
"api_key": os.getenv("ANTHROPIC_API_KEY"),
"api_type": "anthropic",
}
],
"cache_seed": None,
},
)
user_proxy = autogen.UserProxyAgent(
"user_proxy",
human_input_mode="NEVER",
code_execution_config={
"work_dir": "coding",
"use_docker": False,
},
is_termination_msg=lambda x: x.get("content", "") and x.get("content", "").rstrip().endswith("TERMINATE"),
max_consecutive_auto_reply=1,
)
user_proxy.initiate_chat(assistant, message=prompt)
you will get the following error
BadRequestError Traceback (most recent call last)
Cell In[16], [line 34](vscode-notebook-cell:?execution_count=16&line=34)
[7](vscode-notebook-cell:?execution_count=16&line=7) assistant = MultimodalConversableAgent(
[8](vscode-notebook-cell:?execution_count=16&line=8) "assistant",
[9](vscode-notebook-cell:?execution_count=16&line=9) llm_config={
(...)
[20](vscode-notebook-cell:?execution_count=16&line=20) },
[21](vscode-notebook-cell:?execution_count=16&line=21) )
[23](vscode-notebook-cell:?execution_count=16&line=23) user_proxy = autogen.UserProxyAgent(
[24](vscode-notebook-cell:?execution_count=16&line=24) "user_proxy",
[25](vscode-notebook-cell:?execution_count=16&line=25) human_input_mode="NEVER",
(...)
[31](vscode-notebook-cell:?execution_count=16&line=31) max_consecutive_auto_reply=1,
[32](vscode-notebook-cell:?execution_count=16&line=32) )
---> [34](vscode-notebook-cell:?execution_count=16&line=34) user_proxy.initiate_chat(assistant, message=prompt)
File ~/src/intuned-playwriter/.venv/lib/python3.11/site-packages/autogen/agentchat/conversable_agent.py:1019, in ConversableAgent.initiate_chat(self, recipient, clear_history, silent, cache, max_turns, summary_method, summary_args, message, **kwargs)
[1017](https://file+.vscode-resource.vscode-cdn.net/Users/faisal/src/intuned-playwriter/notebooks/~/src/intuned-playwriter/.venv/lib/python3.11/site-packages/autogen/agentchat/conversable_agent.py:1017) else:
[1018](https://file+.vscode-resource.vscode-cdn.net/Users/faisal/src/intuned-playwriter/notebooks/~/src/intuned-playwriter/.venv/lib/python3.11/site-packages/autogen/agentchat/conversable_agent.py:1018) msg2send = self.generate_init_message(message, **kwargs)
-> [1019](https://file+.vscode-resource.vscode-cdn.net/Users/faisal/src/intuned-playwriter/notebooks/~/src/intuned-playwriter/.venv/lib/python3.11/site-packages/autogen/agentchat/conversable_agent.py:1019) self.send(msg2send, recipient, silent=silent)
[1020](https://file+.vscode-resource.vscode-cdn.net/Users/faisal/src/intuned-playwriter/notebooks/~/src/intuned-playwriter/.venv/lib/python3.11/site-packages/autogen/agentchat/conversable_agent.py:1020) summary = self._summarize_chat(
[1021](https://file+.vscode-resource.vscode-cdn.net/Users/faisal/src/intuned-playwriter/notebooks/~/src/intuned-playwriter/.venv/lib/python3.11/site-packages/autogen/agentchat/conversable_agent.py:1021) summary_method,
[1022](https://file+.vscode-resource.vscode-cdn.net/Users/faisal/src/intuned-playwriter/notebooks/~/src/intuned-playwriter/.venv/lib/python3.11/site-packages/autogen/agentchat/conversable_agent.py:1022) summary_args,
[1023](https://file+.vscode-resource.vscode-cdn.net/Users/faisal/src/intuned-playwriter/notebooks/~/src/intuned-playwriter/.venv/lib/python3.11/site-packages/autogen/agentchat/conversable_agent.py:1023) recipient,
...
(...)
[1053](https://file+.vscode-resource.vscode-cdn.net/Users/faisal/src/intuned-playwriter/notebooks/~/src/intuned-playwriter/.venv/lib/python3.11/site-packages/anthropic/_base_client.py:1053) stream_cls=stream_cls,
[1054](https://file+.vscode-resource.vscode-cdn.net/Users/faisal/src/intuned-playwriter/notebooks/~/src/intuned-playwriter/.venv/lib/python3.11/site-packages/anthropic/_base_client.py:1054) )
BadRequestError: Error code: 400 - {'type': 'error', 'error': {'type': 'invalid_request_error', 'message': "messages.0.content.1: Input tag 'image_url' found using 'type' does not match any of the expected tags: 'text', 'image', 'tool_use', 'tool_result'"}}
Model Used
claude-3-5-sonnet-20240620
Expected Behavior
it should work like it does for gpt-4o
Screenshots and logs
Additional Information
No response
Hey @faisalil, unfortunately the Anthropic client class in AutoGen doesn't support multi-modality. If you're able to contribute please feel free to create a Pull Request. I'm happy to help test it :).
@faisalil, I'll mark this as closed for now. If you're able to contribute please feel free to create a Pull Request and I can assist you with it.
Hi @faisalil, I was coming up against the same issue. You can get vision to work with Anthropic by converting the content part of the message containing the image to a format that the Anthropic API accepts. In anthropic.py, replace the function oai_messages_to_anthropic_messages with the three functions below (incidentally, written in full by the new Sonnet 3.5). It also fixes an error I had where the part that reformats the system prompt had type error expecting a string when a dictionary was provided. I haven't tested it thoroughly, but has been working fine so far with local images. Have not tested with remote URLs.
def process_image_content(content_item: Dict[str, Any]) -> Dict[str, Any]:
"""Process an OpenAI image content item into Claude format."""
if content_item['type'] != 'image_url':
return content_item
url = content_item['image_url']['url']
try:
# Handle data URLs
if url.startswith('data:'):
data_url_pattern = r'data:image/([a-zA-Z]+);base64,(.+)'
match = re.match(data_url_pattern, url)
if match:
media_type, base64_data = match.groups()
return {
'type': 'image',
'source': {
'type': 'base64',
'media_type': f'image/{media_type}',
'data': base64_data
}
}
# Handle remote URLs
else:
response = requests.get(url)
response.raise_for_status()
content_type = response.headers.get('content-type', 'image/jpeg')
return {
'type': 'image',
'source': {
'type': 'base64',
'media_type': content_type,
'data': base64.b64encode(response.content).decode('utf-8')
}
}
except Exception as e:
print(f"Error processing image URL: {e}")
# Return original content if image processing fails
return content_item
def process_message_content(message: Dict[str, Any]) -> Union[str, List[Dict[str, Any]]]:
"""Process message content, handling both string and list formats with images."""
content = message.get("content", "")
# Handle empty content
if content == "":
return content
# If content is already a string, return as is
if isinstance(content, str):
return content
# Handle list content (mixed text and images)
if isinstance(content, list):
processed_content = []
for item in content:
if item['type'] == 'text':
processed_content.append({
'type': 'text',
'text': item['text']
})
elif item['type'] == 'image_url':
processed_content.append(process_image_content(item))
return processed_content
return content
def oai_messages_to_anthropic_messages(params: Dict[str, Any]) -> list[dict[str, Any]]:
"""Convert messages from OAI format to Anthropic format.
We correct for any specific role orders and types, etc.
Handles both regular messages, tool calls, and images.
"""
has_tools = "tools" in params
processed_messages = []
user_continue_message = {"content": "Please continue.", "role": "user"}
assistant_continue_message = {
"content": "Please continue.", "role": "assistant"}
tool_use_messages = 0
tool_result_messages = 0
last_tool_use_index = -1
last_tool_result_index = -1
for message in params["messages"]:
if message["role"] == "system":
content = process_message_content(message)
if isinstance(content, list):
# For system messages with images, concatenate only the text portions
text_content = " ".join(
item.get('text', '')
for item in content
if item.get('type') == 'text'
)
params["system"] = params.get(
"system", "") + (" " if "system" in params else "") + text_content
else:
params["system"] = params.get(
"system", "") + (" " if "system" in params else "") + content
else:
expected_role = "user" if len(
processed_messages) % 2 == 0 else "assistant"
if "tool_calls" in message:
# Existing tool call handling
tool_uses = []
tool_names = []
for tool_call in message["tool_calls"]:
tool_uses.append(
ToolUseBlock(
type="tool_use",
id=tool_call["id"],
name=tool_call["function"]["name"],
input=json.loads(
tool_call["function"]["arguments"]),
)
)
if has_tools:
tool_use_messages += 1
tool_names.append(tool_call["function"]["name"])
if expected_role == "user":
processed_messages.append(user_continue_message)
if has_tools:
processed_messages.append(
{"role": "assistant", "content": tool_uses})
last_tool_use_index = len(processed_messages) - 1
else:
processed_messages.append(
{
"role": "assistant",
"content": f"Some internal function(s) that could be used: [{', '.join(tool_names)}]",
}
)
elif "tool_call_id" in message:
# Existing tool result handling
if has_tools:
tool_result = {
"type": "tool_result",
"tool_use_id": message["tool_call_id"],
"content": message["content"],
}
if last_tool_result_index == len(processed_messages) - 1:
processed_messages[-1]["content"].append(tool_result)
else:
if expected_role == "assistant":
processed_messages.append(
assistant_continue_message)
processed_messages.append(
{"role": "user", "content": [tool_result]})
last_tool_result_index = len(processed_messages) - 1
tool_result_messages += 1
else:
processed_messages.append(
{"role": "user", "content": f"Running the function returned: {
message['content']}"}
)
elif message["content"] == "":
pass # Ignoring empty messages
else:
if expected_role != message["role"]:
processed_messages.append(
user_continue_message if expected_role == "user" else assistant_continue_message
)
# Process the message content for images
processed_content = process_message_content(message)
processed_message = message.copy()
processed_message["content"] = processed_content
processed_messages.append(processed_message)
# Handle unmatched tool_use/tool_result
if has_tools and tool_use_messages != tool_result_messages:
processed_messages[last_tool_use_index] = assistant_continue_message
# Remove name field from messages
for message in processed_messages:
if "name" in message:
message.pop("name", None)
# Ensure last message is from user
if processed_messages[-1]["role"] != "user":
processed_messages.append(user_continue_message)
return processed_messages
@jk10001, thanks so much for doing that and I'm sure it will be helpful for others - are you able to put this into a PR and then I can assist with testing it?
Part of the PR will be updating and adding tests in tests/oai/test_anthropic.py. If possible a basic update of the documentation to note or show the new multi-modal capability would also be great.
@marklysze, no worries. I'm pretty new to this. What is needed for the test file?
Hey @jk10001, sure, you can have a look at the current test file for anthropic, essentially each function has a matching test: https://github.com/microsoft/autogen/blob/0.2/test/oai/test_anthropic.py
Some tests may need to be updated based on your code. Feel free to create a PR if you wanted to proceed with getting your code in.