haystack Multimodal support

We will start with support for text generation from text+image input.

Then we will also focus on basic support for multimodal retrieval/RAG.

Mar 05 '25 11:03 julian-risch

WIP draft: https://github.com/deepset-ai/haystack/pull/9167

Apr 07 '25 15:04 julian-risch

Improved POC: #9246

Apr 17 '25 09:04 anakin87

Adding my thoughts here because for me it's one of my first priority for next evolution of our haystack-based pipelines:

The ability to embed images, store and retrieve them from a document store (alongside other text documents). Vision-RAG can help solve a lot of issues for complex files. The idea is to use a multimodal embedding model such as Cohere Embed v4 that can embed both images and text in the same embedding space.

Then, at retrieval, the retrieve get the top-k=1 image and feed it to a vision capable LLM (as ImageContent ChatMessage) with user query. (top-k=1 because LLMs are bad with multiple image input). This technique solves RAG for graph, picture, handwritting, scanned documents. Although it's a very expensive technique (images consume more tokens).

In Haystack terms we need:

Image Embedder (Cohere or self hosted)
Possibility to write image to document store as content key or another one. Image will not always be coupled to text.
Make retriever understand that if it retrieves text it should return content but if it's image it should return the encoded image content ? It's not clear for me but we need a way to check if a retrieved document is an image or a text to feed it accordingly to a ChatMessage (in the prompt for text, as ImageContent part for images)

May 21 '25 08:05 lambda-science

@julian-risch @anakin87

Just curious. Images in Haystack pipelines is a highly requested feature as the underlying OpenAI APIs support this

Any projected estimate for when the basic functionality of it can make the GA Haystack releases?

Jun 03 '25 02:06 lohit8846

Hello @lohit8846, we started adding multimodal experimental features in https://github.com/deepset-ai/haystack-experimental.

More details here: https://github.com/deepset-ai/haystack-experimental/discussions/302
📓 Introduction to Multimodal Text Generation notebook

About the integration time in Haystack, we initially assumed a 3-month experimentation time, so the main features could be integrated in August. We could have longer or shorter times, though.

In any case, if you try the experimental features, please give us feedback in the related GitHub discussion.

Jun 03 '25 08:06 anakin87

Following the introduction of image support in Haystack 2.16.0, we added image support to a larger number of model providers (Amazon Bedrock, Anthropic, Azure, Google, Hugging Face API, Meta Llama API, Mistral, Nvidia, Ollama, OpenAI, OpenRouter, STACKIT) when releasing 2.17.0 and this issue can be closed. There is one open, optional follow up issue here.

Sep 19 '25 13:09 julian-risch