autogen Support multimodal (GPT-4V) directly in OpenAIClient

This PR is part of the #1975

As the [Major Update 1] in "Multimodal Orchestration", it enables the OpenAIClient to load images and to communicate with OpenAI directly.

Hence, all ConversableAgent can use GPT-4V in the llm_config. The multimodal client will read the images before sending to OpenAI.

RISK: For dependencies issue and for avoiding circular import, we move "img_utils" from the contrib folder to the main autogen folder, which may cause errors. Workflow for testing is also updated.

Why are these changes needed?

Related issue number

Checks

[ ] I've included any doc changes needed for https://microsoft.github.io/autogen/. See https://microsoft.github.io/autogen/docs/Contribute#documentation to build and test documentation locally.
[X] I've added tests (if relevant) corresponding to the changes introduced in this PR.
[X] I've made sure all auto checks have passed.

Mar 15 '24 02:03 BeibinLi

Codecov Report

Attention: Patch coverage is 83.01887% with 18 lines in your changes missing coverage. Please review.

Project coverage is 48.76%. Comparing base (989c182) to head (e5205a9). Report is 357 commits behind head on main.

Files	Patch %	Lines
autogen/multimodal_utils.py	87.93%	3 Missing and 4 partials :warning:
autogen/oai/client.py	66.66%	7 Missing :warning:
autogen/agentchat/conversable_agent.py	20.00%	2 Missing and 2 partials :warning:

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #2026       +/-   ##
===========================================
+ Coverage   37.83%   48.76%   +10.93%     
===========================================
  Files          77       78        +1     
  Lines        7766     7796       +30     
  Branches     1663     1809      +146     
===========================================
+ Hits         2938     3802      +864     
+ Misses       4579     3678      -901     
- Partials      249      316       +67

Flag	Coverage Δ
unittest	`14.40% <15.09%> (?)`
unittests	`47.70% <83.01%> (+9.88%)`	:arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Mar 15 '24 02:03 codecov-commenter

So as you know I'm a big fan of this PR -- but am super nervous about introducing a prompting DSL that diverges from other prompting template DSLs (e.g., Guidance), overlaps with HTML, cannot be disabled, and has no escaping conventions.

Anyhow I like everything else in this PR, and the DSL stuff appears already to have been merged... So I am willing to accept when you think it's ready.

Mar 28 '24 02:03 afourney

So as you know I'm a big fan of this PR -- but am super nervous about introducing a prompting DSL that diverges from other prompting template DSLs (e.g., Guidance), overlaps with HTML, cannot be disabled, and has no escaping conventions.

Anyhow I like everything else in this PR, and the DSL stuff appears already to have been merged... So I am willing to accept when you think it's ready.

Yes, this PR is irrelevant to the HTML-style parser. I will create another PR to disable the HTML parsing by default after this PR. I just modified the ConversableAgent to accept OpenAI style multimodal input, see this example (also added to the notebook).

# Here is another way to define images, without using the HTML tag and using the OpenAI format directly.
image_agent = ConversableAgent(
    name="image-explainer",
    max_consecutive_auto_reply=10,
    llm_config={"config_list": config_list_4v, "temperature": 0.5, "max_tokens": 300, "cache_seed": 42},
)

user_proxy = autogen.UserProxyAgent(
    name="User_proxy",
    system_message="A human admin.",
    human_input_mode="NEVER",  # Try between ALWAYS or NEVER
    max_consecutive_auto_reply=0,
    code_execution_config={
        "use_docker": False
    },  # Please set use_docker=True if docker is available to run the generated code. Using docker is safer than running the generated code directly.
)

user_proxy.initiate_chat(
    image_agent,
    message=[
        {"type": "text", "text": "What's the breed of this dog?"},
        {"type": "image_url", "image_url": 
         {"url": "https://th.bing.com/th/id/R.422068ce8af4e15b0634fe2540adea7a?rik=y4OcXBE%2fqutDOw&pid=ImgRaw&r=0"}}]
)

I was still waiting others to formalize the "Message" in autogen, but there are many blockers for that. Maybe I should just go ahead and create a simple ImageMessage class.

Guidance and Gemini are also using special characters for images, such as "<|_image:xxxxx|>", but it would have less chance of collision with HTML tags.

Mar 28 '24 04:03 BeibinLi

So as you know I'm a big fan of this PR -- but am super nervous about introducing a prompting DSL that diverges from other prompting template DSLs (e.g., Guidance), overlaps with HTML, cannot be disabled, and has no escaping conventions.

Anyhow I like everything else in this PR, and the DSL stuff appears already to have been merged... So I am willing to accept when you think it's ready.

Yes, this PR is irrelevant to the HTML-style parser. I will create another PR to disable the HTML parsing by default after this PR. I just modified the ConversableAgent to accept OpenAI style multimodal input, see this example (also added to the notebook).
# Here is another way to define images, without using the HTML tag and using the OpenAI format directly.

image_agent = ConversableAgent(

    name="image-explainer",

    max_consecutive_auto_reply=10,

    llm_config={"config_list": config_list_4v, "temperature": 0.5, "max_tokens": 300, "cache_seed": 42},

)



user_proxy = autogen.UserProxyAgent(

    name="User_proxy",

    system_message="A human admin.",

    human_input_mode="NEVER",  # Try between ALWAYS or NEVER

    max_consecutive_auto_reply=0,

    code_execution_config={

        "use_docker": False

    },  # Please set use_docker=True if docker is available to run the generated code. Using docker is safer than running the generated code directly.

)



user_proxy.initiate_chat(

    image_agent,

    message=[

        {"type": "text", "text": "What's the breed of this dog?"},

        {"type": "image_url", "image_url": 

         {"url": "https://th.bing.com/th/id/R.422068ce8af4e15b0634fe2540adea7a?rik=y4OcXBE%2fqutDOw&pid=ImgRaw&r=0"}}]

)
I was still waiting others to formalize the "Message" in autogen, but there are many blockers for that. Maybe I should just go ahead and create a simple ImageMessage class.

Guidance and Gemini are also using special characters for images, such as "<|_image:xxxxx|>", but it would have less chance of collision with HTML tags.

Yes I like that tag format much better... it doesn't collide with html. Out of curiosity, do you know if those tokens are treated special in their tokenization?

Mar 29 '24 04:03 afourney

@afourney Yes, usually images are treated separately in the model. There is often an "image encoder model" (separate of the LLM) to process the image, and this "encoder" can be thought as a "tokenizer". Usually, the process will convert the image URI string back to an image, cut the image into $k$ patches, and then encode the $k$ patches into $k$ tokens, and then put them back with the text inputs (concatenation of the tokens).

Mar 29 '24 06:03 BeibinLi