langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Native Multimodal support

Open JacobFV opened this issue 1 year ago • 1 comments

Feature request

Define corresponding primitive structures and interfaces for images, audio, and video as has already been done for text.

Currently we have this base Document class:

class Document(BaseModel):
    """Interface for interacting with a document."""

    page_content: str
    metadata: dict = Field(default_factory=dict)

Ideally, we should abstract away the modality agnostic features to a superclass:

class Object(BaseModel):
    """Interface for interacting with data in any modality"""
    metadata: dict = Field(default_factory=dict)
  
class Document(Object):
    """Interface for interacting with a document."""
    page_content: str

and then define Image and Audio structures for those corresponding modalities:

class Image(Object):
    """Interface for interacting with an image."""
    image: np.array

class Audio(Object):
    """Interface for interacting with an audio clip."""
    audio: np.array

class Video(Object):
    """Interface for interacting with a video clip."""
    video: np.array

class CaptionedVideo(Video, Document):
    """Video with captions"""

class SoundVideo(Video, Audio):
   """Video with sound"""

class CaptionedSoundVideo(Video, Audio, Document):
    """Video with captions and sound"""

(Perhaps the Document would be changed to Text to remain consistent with the other modality data structure typenames.)

And also define corresponding model abstractions and implementations:

├── audio_models
│   ├── __init__.py
[...]
├── input.py
├── image_models
│   ├── __init__.py
[...]
├── llms
│   ├── __init__.py
│   ├── ai21.py
[ ... ]
│   └── writer.py
[...]
├── video_models
│   ├── __init__.py
[...]

And somewhere in the schema, we'd add a BaseModel (or similar to avoid pydantic collosion!) which BaseLanguageModel, BaseVisionLanguageModel, BaseVisionModel, etc. would all inherit from.

I'm not sure how many top level modules could be abstracted up to object without concern for the model modality. This would be a major refactor, and probabbly needs some planning. I'd be happy to particapate in the conversation and development.

Motivation

  1. LLaVA, CLAP, BARK, etc. The cambrian explosion is spreading beyond language-only models. Today this includes vision-language and audio-language model, tomorrow, it may include all three or more.
  2. I've got my really awesome AGIAgent, but it can only process text. I'd like a way to just swap out a few modules so it can process images instead, or, in addition to the input
  3. Langchain abstractions are great. I wish they were in the Image dev space.
  4. Langchain can market to a larger audience with multimodal models

Your contribution

I will contribute to the conversation and development.

JacobFV avatar May 07 '23 06:05 JacobFV

Also MultiChain may be a more appropriate name if this is implemented

JacobFV avatar May 07 '23 12:05 JacobFV

I had started discussing this as well on the LangChainJS side (hwchase17/langchainjs#1628) and we now have an embeddings model that could have used it (Google has a multimodal embeddings model the JS side has implemented using additional methods - (hwchase17/langchainjs#2007)).

My thoughts are that we should have:

  • An abstract parent class (that contains the metadata). Call it AbstractDocument for now
  • The Document we have now is a subclass
  • A BlobDocument that just contains abstract bytes. Possibly an additional field that has the contentType (or have this in the abstract parent)
  • A ListDocument that is an ordered list of AbstractDocuments. (So you can represent mixed text and images, or alternate representations, or all the ways contentType is used today.)

This way everything that handles a Document today continues to do so. And we can start moving things over to take AbstractDocument when it can handle more complicated things. I don't think we need specific classes for each possible media type, but specific implementations could make sure they're able to handle the contentType. The TextDocument is a reasonable exception.

I considered redefining Document to take additional fields for blob and list, and make it "one, and only one, of these must be defined". But that ends up having issues with backwards compatibility.

It is arguable that we should rename Document to TextDocument or something, and the abstract class should be Document. I like this idea - but I think it risks breaking a lot of non-library code.

(cc: @jacoblee93 @nfcampos @eyurtsev @hwchase17)

afirstenberg avatar Jul 25 '23 22:07 afirstenberg

Yeah it's definitely be breaking. I dunno what the road map looks like but maybe for 0.1?

JacobFV avatar Jul 25 '23 23:07 JacobFV

Hi, @JacobFV! I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, this issue is a feature request to add native multimodal support to the project. There has been some discussion about the implementation, including suggestions for defining primitive structures and interfaces for images, audio, and video, and abstracting away modality agnostic features to a superclass. There was also mention of renaming the existing Document class. However, the issue remains unresolved.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your contribution to the LangChain project! Let us know if you have any further questions or concerns.

dosubot[bot] avatar Nov 06 '23 16:11 dosubot[bot]

Hey dosubot!

Let's let a human drop the ax on this issue as it could have significant impact to the langchain project. @baskaryan @rlancemartin

JacobFV avatar Nov 06 '23 16:11 JacobFV

@baskaryan Could you please help @JacobFV with this issue? They have indicated that it is still relevant and could have a significant impact on the LangChain project. Thank you!

dosubot[bot] avatar Nov 06 '23 16:11 dosubot[bot]

Hi, @JacobFV,

I'm helping the LangChain team manage their backlog and am marking this issue as stale. The issue you raised proposes adding support for native multimodal features to the LangChain project, including defining primitive structures and interfaces for images, audio, and video, and abstracting modality-agnostic features to a superclass. There has been discussion around the implementation, including suggestions for renaming the existing Document class. However, the issue remains unresolved, and there is a request for further input from maintainers such as @baskaryan and @rlancemartin.

Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days. Thank you!

dosubot[bot] avatar Feb 07 '24 16:02 dosubot[bot]

You can close it. I'm working on a competing framework: FraneChain for images and BrainChain for full multi modal

JacobFV avatar Feb 07 '24 16:02 JacobFV

Thanks! We've added images in messages and are exploring ways for other modalities as well.

jacoblee93 avatar Feb 07 '24 17:02 jacoblee93

@jacoblee93 Where can I find code or documentation on images in messages?

JacobFV avatar Mar 28 '24 07:03 JacobFV

Examples on individual integration pages, also:

https://js.langchain.com/docs/integrations/chat/openai#multimodal-messages

jacoblee93 avatar Mar 28 '24 17:03 jacoblee93