fiftyone-multimodal-rag-plugin
                                
                                 fiftyone-multimodal-rag-plugin copied to clipboard
                                
                                    fiftyone-multimodal-rag-plugin copied to clipboard
                            
                            
                            
                        Testbed for multimodal retrieval augmented generation techniques with FiftyOne, LlamaIndex, and Milvus
Multimodal RAG with FiftyOne, LlamaIndex, and Milvus
Introduction
Retrieval augmentated generation (RAG) has grown increasingly popular as a way to improve the quality of text generated by large language models. Now that multimodal LLMs are in vouge, it's time to extend RAG to multimodal data.
When we add in the ability to search and retrieve data across multiple modalities, we get a powerful tool for interacting with the most powerful AI models available today. However, we also add brand new layers of complexity to the process.
Some of the considerations we need to take into account include:
- How do we chunk and index multimodal data? Do we split it into separate modalities or keep it together?
- How do we search multimodal data? Do we search each modality separately and then combine the results, or do we search them together?
- What new strategies can we use to improve the quality of the data we generate?
On a more practical level, here are some of the basic knobs we can turn:
- Text embedding model: Which model do we use to embed the text?
- Image representation: Do we embed the image with a multimodal model (like CLIP) or use captions?
- How many image and text results do we want to retrieve?
- Which multimodal model do we use to generate our retrieval-augmented results?
This project is a testbed for exploring these questions and more. It uses three open source libraries, FiftyOne, LlamaIndex, and Milvus, to make the process of working with multimodal data, experimenting with different multimodal RAG techniques, and finding what works best for your use-case as easy as possible.
:warning: This project is a work in progress. It may be rough around the edges, and some features may not work as expected. If you run into any issues, please open an issue on this repository βΒ or better yet, submit a pull request!
Also note that LlamaIndex frequently updates its API. This is why the version of LlamaIndex and its associated packages are all pinned π
Installation
First, install FiftyOne:
pip install fiftyone
Next, using FiftyOne's CLI syntax, download and install the FiftyOne Multimodal RAG plugin:
fiftyone plugins download https://github.com/jacobmarks/fiftyone-multimodal-rag-plugin
LlamaIndex has a verbose installation process (if you want to build anything multimodal at least). Fortunately for you, this (and all other install dependencies) will be taken care of with the following command:
fiftyone plugins requirements @jacobmarks/multimodal_rag --install
Usage
Setup
To get started, launch the FiftyOne App. You can do so from the terminal by running:
fiftyone app launch
Or you can run the following Python code:
import fiftyone as fo
session = fo.launch_app()
Creating a Multimodal Dataset
Now press the backtick key (`) and type create_dataset_from_llama_documents.
Press Enter to open the operator's modal. This operator gives you a UI to choose
a directory containing your multimodal data (images, text files, pdfs, etc) and create a FiftyOne dataset from it.
Once you've selected a directory, execute the operator. It will create a new dataset in your FiftyOne session. For text files, you will see a image rendering of the truncated text. For images, you will see the image itself.
π‘ You can add additional directories of multimodal data with the add_llama_documents_to_dataset operator.
Indexing the Multimodal Dataset
Now that you have a multimodal dataset, you can index it with LlamaIndex and Milvus.
Use the create_multimodal_rag_index operator to enter this process. This operator
will prompt you to name the index, and will give you the option to index the images
via CLIP embeddings or captions. If you choose captions, you will be prompted to select
the text field to use as the caption.
π‘ If you do not have captions on your dataset, you might be interested in the FiftyOne Image Captioning Plugin.
fiftyone plugins download https://github.com/jacobmarks/fiftyone-image-captioning-plugin
Inspect an Index
Once you have created an index, you can inspect it by running the get_multimodal_rag_index_info operator
and selecting the index you want to inspect from the dropdown.
Querying the Index
Finally, you can query the index with the query_multimodal_rag_index operator.
This operator will prompt you to enter a query string, and an index to query.
You can also specify the multimodal model to use for generating the retrieval-augmented results, as well as both the number of image and text results to retrieve.