Multimodal way to retrieve documents
Hello, Thank you for the amazing library. I referred the RAG implementation notebooks from hugging face using this library and loved it. I am just wondering if there is a way to not just have a text prompt but feed in an image as well to retrieve multimodal documents or images.
To be specific the user query contains text as well as an image input to colpali that can retrieve top k similar docs which then can be used by the vlm including the prompt and the image for generating an output.
I am not sure byaldi supports that integration or even colpali does. Or is there any other way. Hope this is the right space to raise this issue or discussion!
Please refer to #71 . Where it explain your request