camel
camel copied to clipboard
[Feature Request] Multi-modal RAG(Retrieval-Augmented Generation)
Required prerequisites
- [X] I have searched the Issue Tracker and Discussions that this hasn't already been reported. (+1 or comment there if it has.)
- [X] Consider asking first in a Discussion.
Motivation
The current RAG can only retrieve text-type information, and cannot retrieve and extract information from images, audio, and other information.
Solution
For images:
- [ ] Integrate VLM matching embedding model (CLIP)
- [ ] RAG function with image retrieval
For other modalities: tbc