haystack icon indicating copy to clipboard operation
haystack copied to clipboard

Multimodal support

Open julian-risch opened this issue 9 months ago • 5 comments

We will start with support for text generation from text+image input.

Then we will also focus on basic support for multimodal retrieval/RAG.

julian-risch avatar Mar 05 '25 11:03 julian-risch

WIP draft: https://github.com/deepset-ai/haystack/pull/9167

julian-risch avatar Apr 07 '25 15:04 julian-risch

Improved POC: #9246

anakin87 avatar Apr 17 '25 09:04 anakin87

Adding my thoughts here because for me it's one of my first priority for next evolution of our haystack-based pipelines:

The ability to embed images, store and retrieve them from a document store (alongside other text documents). Vision-RAG can help solve a lot of issues for complex files. The idea is to use a multimodal embedding model such as Cohere Embed v4 that can embed both images and text in the same embedding space.

Then, at retrieval, the retrieve get the top-k=1 image and feed it to a vision capable LLM (as ImageContent ChatMessage) with user query. (top-k=1 because LLMs are bad with multiple image input). This technique solves RAG for graph, picture, handwritting, scanned documents. Although it's a very expensive technique (images consume more tokens).

In Haystack terms we need:

  • Image Embedder (Cohere or self hosted)
  • Possibility to write image to document store as content key or another one. Image will not always be coupled to text.
  • Make retriever understand that if it retrieves text it should return content but if it's image it should return the encoded image content ? It's not clear for me but we need a way to check if a retrieved document is an image or a text to feed it accordingly to a ChatMessage (in the prompt for text, as ImageContent part for images)

lambda-science avatar May 21 '25 08:05 lambda-science

@julian-risch @anakin87

Just curious. Images in Haystack pipelines is a highly requested feature as the underlying OpenAI APIs support this

Any projected estimate for when the basic functionality of it can make the GA Haystack releases?

lohit8846 avatar Jun 03 '25 02:06 lohit8846

Hello @lohit8846, we started adding multimodal experimental features in https://github.com/deepset-ai/haystack-experimental.

About the integration time in Haystack, we initially assumed a 3-month experimentation time, so the main features could be integrated in August. We could have longer or shorter times, though.

In any case, if you try the experimental features, please give us feedback in the related GitHub discussion.

anakin87 avatar Jun 03 '25 08:06 anakin87

Following the introduction of image support in Haystack 2.16.0, we added image support to a larger number of model providers (Amazon Bedrock, Anthropic, Azure, Google, Hugging Face API, Meta Llama API, Mistral, Nvidia, Ollama, OpenAI, OpenRouter, STACKIT) when releasing 2.17.0 and this issue can be closed. There is one open, optional follow up issue here.

julian-risch avatar Sep 19 '25 13:09 julian-risch