Multimodal way to retrieve documents

Open AmazingK2k3 opened this issue 10 months ago • 1 comments

Hello, Thank you for the amazing library. I referred the RAG implementation notebooks from hugging face using this library and loved it. I am just wondering if there is a way to not just have a text prompt but feed in an image as well to retrieve multimodal documents or images.

To be specific the user query contains text as well as an image input to colpali that can retrieve top k similar docs which then can be used by the vlm including the prompt and the image for generating an output.

I am not sure byaldi supports that integration or even colpali does. Or is there any other way. Hope this is the right space to raise this issue or discussion!

Feb 24 '25 05:02 AmazingK2k3

Please refer to #71 . Where it explain your request

Mar 10 '25 09:03 Tironi