camel
camel copied to clipboard
[Roadmap] Multimodal Agent Roadmap
Required prerequisites
- [X] I have searched the Issue Tracker and Discussions that this hasn't already been reported. (+1 or comment there if it has.)
- [X] Consider asking first in a Discussion.
Motivation
Currently, large multimodal models (LMMs) are gradually replacing large language models (LLMs). Different from LLMs, LMMs usually allow inputs with multiple modalities and some models even have outputs with multiple modalities. With more modalities, LMMs offer more flexibility and make models perform more kinds of tasks. Thus, this means that utilizing multimodal models in agents will potentially enhance Camel Agent's capability. Recent famous LMMs include GPT-4V, Gemini, and Claude 3. In this feature request, we mainly focus on GPT-4V, but we need to make the interface general to other kinds of LMMs.
Solution
Basic Multimodal Agent (with GPT-4V):
- Enable and add
image_url
to camel agent'sinput_message
, whereimage_url
can be url to image or base64 encoded image data. May need to modify theBaseMessage
. - Agent
memory
need to support image storage, some kinds ofmemory
may not support image storage. DefaultChatHistoryMemory
should work well. - Update
OpenAITokenCounter
to include counting image tokens. - Add image related examples such as OCI or object detection to verify agent with image modality.
- Require adding new image related prompt in
prompts
folder.
- Require adding new image related prompt in
- Brainstorm more interesting example that user and assistant can utilize image modality in a collaborative way.
Advanced Multimodal Agent (GPT-4V):
- Enable image modality in
EmbodiedAgent
and create some interesting examples.
Multimodal Agent with different LMMs:
- Support Claude 3 or Gemini etc. other than GPT-4V.
Alternatives
No response
Additional context
No response
An important modification will be support multimodal in BaseMessage
, which is our primary data exchange format. It may require lots of code changing to refactor it.
@dandansamax IMO it depends on how we want to store the images. If we just store the images with base64 encoded format (which is also in string format), then we may not need to much changes. We can discuss more details on how we want to store the images offline.
@dandansamax IMO it depends on how we want to store the images. If we just store the images with base64 encoded format (which is also in string format), then we may not need to much changes. We can discuss more details on how we want to store the images offline.
@zechengz I agree, using base64 for image storage seems promising. However, I'm concerned about potential slowdowns in image editing processes and we still need to modify the BaseMessage
structure for differentiating between image and text content. Let's delve into this further in a meeting.
Discussed with @dandansamax offline, in general we will
- Modify the BaseMessage
- Add image: Optional[PIL.IMAGE]
- We store the image because we need some image stats such as image size
- Just focus on base64 and not image url
- Some memory only supports the text, we can detect and raise an error
- Add image: Optional[PIL.IMAGE]
- See previous multimodal prompt PR [see the PR https://github.com/camel-ai/camel/pull/320]