[Feature Request] Multimodal RAG support for custom retrievers
Problem Description
Agno doesn't support multimodal rag out of the box in RAG, however users can implement custom retrievers but the issue with custom retrievers is that the current implementation doesn't handle non-text content types like images, audio, and video. This makes it difficult to build comprehensive retrieval pipelines that can process and retrieve multimodal information seamlessly.
Proposed Solution
Implement multimodal RAG support for custom retrievers that can extract and process various media types (images, audio, video) from the knowledge base. The solution should:
- Provide a standardized interface for custom retrievers to access these media elements
- Allow the extracted media to be added to dedicated object list for the agent based on media type (images, audio, videos)
- Support vector embeddings for these media types to enable similarity search
- Ensure the retriever can return both the document context and the associated media elements in search results
The implementation should maintain backward compatibility with existing text-based retrieval systems while extending functionality to handle multimodal content.
Alternatives Considered
I've explored a few workarounds but none provide a complete solution:
-
Add non-text content as part of the metadata for the current
agno.documentsetup. This approach keeps the document content as text-only while storing images, audio, or video as metadata. While this maintains backward compatibility, it creates an artificial separation between text and other modalities, making it difficult to treat all content types with equal importance in retrieval. -
Generalize the
agno.documentclass to work for all modalities i.e., its member variablecontentcan be any modality (text, image, audio, video). This approach would require significant refactoring of existing code that assumes content is text-based, potentially breaking compatibility with current implementations.
Both approaches present challenges in determining whether non-text content should be treated as primary content or supplementary metadata, making it difficult to choose a consistent approach that works well across different use cases.
Additional context
Would you like to work on this?
[ ] Yes, I'd love to work on it!
[x] I'm open to collaborating but need guidance.
[ ] No, I'm just sharing the idea.
Hey @DavidAkinpelu, Thank you so much for raising this — really appreciate you taking the time to share your thoughts! I’ll add this to our community wishlist and bring it up with the team internally.
Also, since Agno is open source, if you feel excited about it, feel free to tag a GO and take a crack at building it! We’d absolutely love to see your contribution and are always happy to support community-led efforts 🙌
Hey @monali7-d I am open to contributing. There are different ways to go about this, just like I mentioned above. Let me know which approach the team prefers and I will make a PR.
Hey @DavidAkinpelu, Thanks again for sharing this! We're adding it to our roadmap. If you're still interested, feel free to give it a go — we'd love to see what you come up with. And of course, if you run into any issues or have questions, don’t hesitate to reach out. We're here to help!