Multimodal support
We will start with support for text generation from text+image input.
Then we will also focus on basic support for multimodal retrieval/RAG.
WIP draft: https://github.com/deepset-ai/haystack/pull/9167
Improved POC: #9246
Adding my thoughts here because for me it's one of my first priority for next evolution of our haystack-based pipelines:
The ability to embed images, store and retrieve them from a document store (alongside other text documents). Vision-RAG can help solve a lot of issues for complex files. The idea is to use a multimodal embedding model such as Cohere Embed v4 that can embed both images and text in the same embedding space.
Then, at retrieval, the retrieve get the top-k=1 image and feed it to a vision capable LLM (as ImageContent ChatMessage) with user query. (top-k=1 because LLMs are bad with multiple image input). This technique solves RAG for graph, picture, handwritting, scanned documents. Although it's a very expensive technique (images consume more tokens).
In Haystack terms we need:
- Image Embedder (Cohere or self hosted)
- Possibility to write image to document store as
contentkey or another one. Image will not always be coupled to text. - Make retriever understand that if it retrieves text it should return
contentbut if it's image it should return the encoded image content ? It's not clear for me but we need a way to check if a retrieved document is an image or a text to feed it accordingly to a ChatMessage (in the prompt for text, as ImageContent part for images)
@julian-risch @anakin87
Just curious. Images in Haystack pipelines is a highly requested feature as the underlying OpenAI APIs support this
Any projected estimate for when the basic functionality of it can make the GA Haystack releases?
Hello @lohit8846, we started adding multimodal experimental features in https://github.com/deepset-ai/haystack-experimental.
- More details here: https://github.com/deepset-ai/haystack-experimental/discussions/302
- 📓 Introduction to Multimodal Text Generation notebook
About the integration time in Haystack, we initially assumed a 3-month experimentation time, so the main features could be integrated in August. We could have longer or shorter times, though.
In any case, if you try the experimental features, please give us feedback in the related GitHub discussion.
Following the introduction of image support in Haystack 2.16.0, we added image support to a larger number of model providers (Amazon Bedrock, Anthropic, Azure, Google, Hugging Face API, Meta Llama API, Mistral, Nvidia, Ollama, OpenAI, OpenRouter, STACKIT) when releasing 2.17.0 and this issue can be closed. There is one open, optional follow up issue here.