camel icon indicating copy to clipboard operation
camel copied to clipboard

[Roadmap] Multimodal Agent Roadmap

Open zechengz opened this issue 11 months ago • 4 comments

Required prerequisites

  • [X] I have searched the Issue Tracker and Discussions that this hasn't already been reported. (+1 or comment there if it has.)
  • [X] Consider asking first in a Discussion.

Motivation

Currently, large multimodal models (LMMs) are gradually replacing large language models (LLMs). Different from LLMs, LMMs usually allow inputs with multiple modalities and some models even have outputs with multiple modalities. With more modalities, LMMs offer more flexibility and make models perform more kinds of tasks. Thus, this means that utilizing multimodal models in agents will potentially enhance Camel Agent's capability. Recent famous LMMs include GPT-4V, Gemini, and Claude 3. In this feature request, we mainly focus on GPT-4V, but we need to make the interface general to other kinds of LMMs.

Solution

Basic Multimodal Agent (with GPT-4V):

  • Enable and add image_url to camel agent's input_message, where image_url can be url to image or base64 encoded image data. May need to modify the BaseMessage.
  • Agent memory need to support image storage, some kinds of memory may not support image storage. Default ChatHistoryMemory should work well.
  • Update OpenAITokenCounter to include counting image tokens.
  • Add image related examples such as OCI or object detection to verify agent with image modality.
    • Require adding new image related prompt in prompts folder.
  • Brainstorm more interesting example that user and assistant can utilize image modality in a collaborative way.

Advanced Multimodal Agent (GPT-4V):

  • Enable image modality in EmbodiedAgent and create some interesting examples.

Multimodal Agent with different LMMs:

  • Support Claude 3 or Gemini etc. other than GPT-4V.

Alternatives

No response

Additional context

No response

zechengz avatar Mar 07 '24 09:03 zechengz

An important modification will be support multimodal in BaseMessage, which is our primary data exchange format. It may require lots of code changing to refactor it.

dandansamax avatar Mar 07 '24 12:03 dandansamax

@dandansamax IMO it depends on how we want to store the images. If we just store the images with base64 encoded format (which is also in string format), then we may not need to much changes. We can discuss more details on how we want to store the images offline.

zechengz avatar Mar 08 '24 10:03 zechengz

@dandansamax IMO it depends on how we want to store the images. If we just store the images with base64 encoded format (which is also in string format), then we may not need to much changes. We can discuss more details on how we want to store the images offline.

@zechengz I agree, using base64 for image storage seems promising. However, I'm concerned about potential slowdowns in image editing processes and we still need to modify the BaseMessage structure for differentiating between image and text content. Let's delve into this further in a meeting.

dandansamax avatar Mar 08 '24 16:03 dandansamax

Discussed with @dandansamax offline, in general we will

  • Modify the BaseMessage
    • Add image: Optional[PIL.IMAGE]
      • We store the image because we need some image stats such as image size
    • Just focus on base64 and not image url
    • Some memory only supports the text, we can detect and raise an error
  • See previous multimodal prompt PR [see the PR https://github.com/camel-ai/camel/pull/320]

zechengz avatar Mar 11 '24 17:03 zechengz