dify Add support multimodal RAG (with ColPali/ColQwen modal embedding)

Self Checks

[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] [FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
[X] Please do not modify this template :) and fill in all the required fields.

1. Is this request related to a challenge you're experiencing? Tell me about your story.

Dify currently supports only text embedding.
When working with files that contain many images, charts, and tables, these elements need to be converted to text using OCR (Optical Character Recognition) and Document Layout Detection tools.
Colpali, a multi-modal embedding model, demonstrates impressive Retrieval-Augmented Generation (RAG) performance compared to traditional RAG models.

2. Additional context or comments

This my demo. The retrieves images (after converting pages in a file into images) related to a query and utilizes multimodal LLMs to generate answers. The results are highly impressive.

3. Can you help us with this feature?

[X] I am interested in contributing to this feature.

Nov 22 '24 10:11 k1eNdn

@k1endn, how did you build the chatbot in your demo?

Apr 28 '25 07:04 SkalaFrost

@SkalaFrost Yes, it looks like this

Apr 28 '25 10:04 k1eNdn

@k1endn , Can you guide me on how to do it?

Apr 29 '25 03:04 SkalaFrost

Did you contribute to fix this? I don't understand how it was closed

Jun 04 '25 11:06 nPeppon

amazing, can you share us in details how you do this? Showing how you created your workflow, is there any custom tools or plugins that you used...

Jun 12 '25 06:06 xzneozx96

@xzneozx96 I made many modifications in version 0.6.8. Now I'm reviewing the latest version of the code to see if it's possible to develop a multi-modal RAG plugin.

Jun 12 '25 09:06 k1eNdn

@xzneozx96 I made many modifications in version 0.6.8. Now I'm reviewing the latest version of the code to see if it's possible to develop a multi-modal RAG plugin.

amazing, looking forward to it, however, can u share how to make it work behind the scene? Currently, even though Dify RAG system is powerful but it's still text-only, unable to retrieve both text and images/videos (even just url) to be included in the LLM's response. Example: when you ask about a location, you would expect the AI to show some images or videos about this location, seem built-in RAG system of Dify is not able to do this

Jun 13 '25 06:06 xzneozx96

Hi, @k1eNdn. I'm Dosu, and I'm helping the Dify team manage their backlog and am marking this issue as stale.

Issue Summary:

You requested adding multimodal Retrieval-Augmented Generation (RAG) support in Dify using ColPali/ColQwen embeddings to handle images, charts, and tables beyond OCR text conversion.
You shared a demo chatbot and mentioned extensive modifications in version 0.6.8, currently reviewing the latest code to develop a multimodal RAG plugin.
Other users have shown interest in guidance and implementation details for this feature.
The discussion reflects a demand for Dify to support retrieval of both text and multimedia content in responses.
The issue remains unresolved with no recent updates.

Next Steps:

Please let me know if this issue is still relevant to the latest version of Dify by commenting here to keep the discussion open.
Otherwise, I will automatically close this issue in 15 days.

Thank you for your understanding and contribution!

Aug 31 '25 16:08 dosubot[bot]