cookbook Added 'Multimodal_RAG_for_youtube_videos' in the examples folder

This PR adds a new notebook demonstrating a Multimodal Retrieval Augmented Generation (RAG) pipeline that integrates-

Gemini 2.0 Flash for multimodal reasoning over text and images.
LlamaIndex for indexing and retrieving relevant documents
Qdrant as the local vector store for both text and image embeddings.

It extracts transcript and frames (as images ) from YouTube videos and performs multimodal RAG based on user query. It retrieves the most relevant text and images, then uses Gemini 2.0 Flash to generate responses grounded in both modalities.

I have sighted two example queries from a video which show the importance of capturing the visual aspect of a video for an accurate response.

This highlights the value of multimodal context, especially in cases where visual elements (like slides, charts, diagrams) are required for answering the query, something pure text-based RAG may miss.

Apr 04 '25 09:04 YashviMehta03

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

Apr 04 '25 09:04 review-notebook-app[bot]

Going through the notebook, I noticed It’s relying on a bunch of external libraries—some of which seem to overlap with stuff that's already available in the Gemini module. Might be worth tidying that up a bit? Right now, it feels a bit all over the place, with random Hugging Face bits and some extra packages that don’t feel needed.

Also, I noticed a few deprecated modules in there. We’re trying to move away from those, just to keep things future-proof.

One other thing (not a big deal, just mentioning it)—some parts of the notebook kind of give off that AI-generated vibe. Not saying that’s bad or anything, but it adds a bit of friction when you’re trying to follow along. If we could clean it up and make it a bit more straightforward, I think it’d be way easier to use—and more in line with what we’re aiming for in the project.

Apr 06 '25 04:04 andycandy

Thankyou for the insights, I understood your point. Locally I have migrated to the new sdk and instead of downloading the video I am using yt_dlp to simply extract frames.

For the RAG pipeline, what could be done is either -summarize images using a gemini model and then use the gemini text embedding model or -directly use the multimodal embedding model of gemini.

Next, we store those embeddings in the vector store like qdrant/FAISS etc.

What are your views on this?

Apr 06 '25 13:04 YashviMehta03

@YashviMehta03 Thanks for the submission.

As @andycandy pointed out, the main issue is that you are using a bunch of SDK that are in violation of the Youtube T&C. I strongly suggest checking what can be done with the official Youtube integration, and updating o the new SDK and come back to me for a new review.

Apr 07 '25 09:04 Giom-V

Hello @Giom-V , Thankyou for your suggestions !

I have made the following changes -

Updated to the new sdk
Instead of downloading the youtube video, i am directly extracting frames
Replaced the use of YouTube transcript API by using a prompt to the model to extract transcripts
Changed the link in the end, it now points to embeddings.
Added a MODEL_ID variable to toggle between different models.

But there is an issue with using a prompt to extract transcripts. It is not able to extract the complete transcript. It only extracts about 7-10 min of transcripts. What could be the reason? I checked the output tokens as well, they are well within the limit.

Could you kindly review and give feedbacks.

Apr 10 '25 07:04 YashviMehta03

Marking this pull request as stale since it has been open for 14 days with no activity. This PR will be closed if no further activity occurs.

Apr 24 '25 22:04 github-actions[bot]

Thanks for the updates @YashviMehta03 .

Instead of downloading the youtube video, i am directly extracting frames That's still a problem, extracting the frames is also against the Youtube Terms and conditions. I think you should just get rid of that part. I don't think there's any way to make it work officially to be honest.

But there is an issue with using a prompt to extract transcripts. It is not able to extract the complete transcript. It only extracts about 7-10 min of transcripts. What could be the reason? I checked the output tokens as well, they are well within the limit. I think you are reaching the model output limit. I think it's fine for a demo that way, maybe either select a short enough video, or you could ask gemini to create a detailed summary with timestamps instead of the transcript.

Apr 29 '25 15:04 Giom-V

Marking this pull request as stale since it has been open for 14 days with no activity. This PR will be closed if no further activity occurs.

May 13 '25 22:05 github-actions[bot]

This pull request was closed because it has been inactive for 27 days. Please open a new pull request if you need further assistance. Thanks!

May 27 '25 22:05 github-actions[bot]