Support multimodal (GPT-4V) directly in OpenAIClient
This PR is part of the #1975
As the [Major Update 1] in "Multimodal Orchestration", it enables the OpenAIClient to load images and to communicate with OpenAI directly.
Hence, all ConversableAgent can use GPT-4V in the llm_config. The multimodal client will read the images before sending to OpenAI.
RISK: For dependencies issue and for avoiding circular import, we move "img_utils" from the contrib folder to the main autogen folder, which may cause errors. Workflow for testing is also updated.
Why are these changes needed?
Related issue number
Checks
- [ ] I've included any doc changes needed for https://microsoft.github.io/autogen/. See https://microsoft.github.io/autogen/docs/Contribute#documentation to build and test documentation locally.
- [X] I've added tests (if relevant) corresponding to the changes introduced in this PR.
- [X] I've made sure all auto checks have passed.
Codecov Report
Attention: Patch coverage is 83.01887% with 18 lines in your changes missing coverage. Please review.
Project coverage is 48.76%. Comparing base (
989c182) to head (e5205a9). Report is 357 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #2026 +/- ##
===========================================
+ Coverage 37.83% 48.76% +10.93%
===========================================
Files 77 78 +1
Lines 7766 7796 +30
Branches 1663 1809 +146
===========================================
+ Hits 2938 3802 +864
+ Misses 4579 3678 -901
- Partials 249 316 +67
| Flag | Coverage Δ | |
|---|---|---|
| unittest | 14.40% <15.09%> (?) |
|
| unittests | 47.70% <83.01%> (+9.88%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
So as you know I'm a big fan of this PR -- but am super nervous about introducing a prompting DSL that diverges from other prompting template DSLs (e.g., Guidance), overlaps with HTML, cannot be disabled, and has no escaping conventions.
Anyhow I like everything else in this PR, and the DSL stuff appears already to have been merged... So I am willing to accept when you think it's ready.
So as you know I'm a big fan of this PR -- but am super nervous about introducing a prompting DSL that diverges from other prompting template DSLs (e.g., Guidance), overlaps with HTML, cannot be disabled, and has no escaping conventions.
Anyhow I like everything else in this PR, and the DSL stuff appears already to have been merged... So I am willing to accept when you think it's ready.
Yes, this PR is irrelevant to the HTML-style parser. I will create another PR to disable the HTML parsing by default after this PR. I just modified the ConversableAgent to accept OpenAI style multimodal input, see this example (also added to the notebook).
# Here is another way to define images, without using the HTML tag and using the OpenAI format directly.
image_agent = ConversableAgent(
name="image-explainer",
max_consecutive_auto_reply=10,
llm_config={"config_list": config_list_4v, "temperature": 0.5, "max_tokens": 300, "cache_seed": 42},
)
user_proxy = autogen.UserProxyAgent(
name="User_proxy",
system_message="A human admin.",
human_input_mode="NEVER", # Try between ALWAYS or NEVER
max_consecutive_auto_reply=0,
code_execution_config={
"use_docker": False
}, # Please set use_docker=True if docker is available to run the generated code. Using docker is safer than running the generated code directly.
)
user_proxy.initiate_chat(
image_agent,
message=[
{"type": "text", "text": "What's the breed of this dog?"},
{"type": "image_url", "image_url":
{"url": "https://th.bing.com/th/id/R.422068ce8af4e15b0634fe2540adea7a?rik=y4OcXBE%2fqutDOw&pid=ImgRaw&r=0"}}]
)
I was still waiting others to formalize the "Message" in autogen, but there are many blockers for that. Maybe I should just go ahead and create a simple ImageMessage class.
Guidance and Gemini are also using special characters for images, such as "<|_image:xxxxx|>", but it would have less chance of collision with HTML tags.
So as you know I'm a big fan of this PR -- but am super nervous about introducing a prompting DSL that diverges from other prompting template DSLs (e.g., Guidance), overlaps with HTML, cannot be disabled, and has no escaping conventions.
Anyhow I like everything else in this PR, and the DSL stuff appears already to have been merged... So I am willing to accept when you think it's ready.
Yes, this PR is irrelevant to the HTML-style parser. I will create another PR to disable the HTML parsing by default after this PR. I just modified the
ConversableAgentto accept OpenAI style multimodal input, see this example (also added to the notebook).# Here is another way to define images, without using the HTML tag and using the OpenAI format directly. image_agent = ConversableAgent( name="image-explainer", max_consecutive_auto_reply=10, llm_config={"config_list": config_list_4v, "temperature": 0.5, "max_tokens": 300, "cache_seed": 42}, ) user_proxy = autogen.UserProxyAgent( name="User_proxy", system_message="A human admin.", human_input_mode="NEVER", # Try between ALWAYS or NEVER max_consecutive_auto_reply=0, code_execution_config={ "use_docker": False }, # Please set use_docker=True if docker is available to run the generated code. Using docker is safer than running the generated code directly. ) user_proxy.initiate_chat( image_agent, message=[ {"type": "text", "text": "What's the breed of this dog?"}, {"type": "image_url", "image_url": {"url": "https://th.bing.com/th/id/R.422068ce8af4e15b0634fe2540adea7a?rik=y4OcXBE%2fqutDOw&pid=ImgRaw&r=0"}}] )I was still waiting others to formalize the "Message" in autogen, but there are many blockers for that. Maybe I should just go ahead and create a simple
ImageMessageclass.Guidance and Gemini are also using special characters for images, such as "<|_image:xxxxx|>", but it would have less chance of collision with HTML tags.
Yes I like that tag format much better... it doesn't collide with html. Out of curiosity, do you know if those tokens are treated special in their tokenization?
@afourney Yes, usually images are treated separately in the model. There is often an "image encoder model" (separate of the LLM) to process the image, and this "encoder" can be thought as a "tokenizer". Usually, the process will convert the image URI string back to an image, cut the image into $k$ patches, and then encode the $k$ patches into $k$ tokens, and then put them back with the text inputs (concatenation of the tokens).