Feat: Add Cookbook for Content-Based Recommendations with Gemini & Qdrant
Status: Work in Progress / Seeking Feedback
This PR presents the initial implementation of the Gemini+Qdrant recommendation cookbook outlined in #690. While the core functionality is present, I am actively seeking feedback and collaborating with reviewers to refine the notebook's structure, clarity, and explanations before final merging. Please feel free to add comments and suggestions!
Description
This PR introduces a new cookbook example demonstrating how to build a scalable, content-based recommendation system using Google's Gemini API for semantic embeddings and Qdrant as an efficient vector database.
Motivation
As outlined in #690 , existing examples often use toy datasets or focus on narrow use cases. This cookbook aims to provide a more practical, near-production-level demonstration showing how Large Language Model (LLM) embeddings, specifically from Gemini, can power recommendations across potentially diverse media types without relying on user interaction history (addressing cold-start scenarios).
Solution:
A new Jupyter notebook, Movie_Recommendation, has been added. This notebook guides the user through the following steps:
- Setup: Installs necessary libraries (
google-genai,qdrant-client,pandas), configures the Gemini API client, and initializes the Qdrant client. - Data Loading & Preparation:
- Loads a sample of the TMDB Movies Dataset. Note: The notebook uses a sample (e.g., 5000 items) for demonstration purposes due to the scale of the full dataset and API costs associated with embedding millions of items. Clear comments explain this and how to adapt for the full dataset.
- Selects relevant features (
title,overview,genres,keywords,tagline,release_date). - Performs data cleaning: handles missing titles, fills missing optional text fields, and extracts the release year.
- Constructs a combined
text_for_embeddingstring for each movie, leveraging available metadata.
- Embedding & Indexing with Qdrant:
- Defines a function (
get_embeddings_batch) to efficiently generate embeddings for batches of text using the Geminiembedding-001model, including retry logic. - Creates a Qdrant collection with appropriate vector parameters (size 768, Cosine distance).
- Implements a batched upsert process:
- Iterates through the data sample in batches.
- Generates embeddings for each batch via the Gemini API.
- Creates Qdrant
PointStructobjects containing the vector, a unique ID (movie_id), and a structuredpayload(movie metadata like title, genre, year, overview). - Upserts these points into the Qdrant collection in batches for efficiency.
- Verifies the number of points indexed in Qdrant.
- Defines a function (
- Querying & Recommendation:
- Defines a
recommend_moviesfunction that:- Takes a natural language query (e.g., movie title, theme description).
- Generates an embedding for the query using Gemini (
task_type="RETRIEVAL_QUERY"). - Performs a similarity search against the indexed movie vectors in Qdrant using
client.search. - Returns the top K most similar movies, including their metadata (payload) and similarity scores.
- Includes example queries to demonstrate the recommendation functionality.
- Defines a
Disclaimer
This PR uses the TMDb Movie Dataset for non-commercial purposes only, as per its licensing terms.
- The dataset is licensed under CC BY-NC 4.0, which restricts usage to non-commercial projects.
- Proper attribution has been provided to TMDb as required by the license.
By submitting this PR, I confirm that:
- This dataset is used solely for educational and demonstration purposes.
- Any further use of this dataset must comply with its licensing terms, and users are responsible for ensuring compliance.
Check out this pull request on ![]()
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
@markmcd @Giom-V Here is my first rough draft. Could you go through it once?
@andycandy Overall I like the example, and I think it would be a great addition to the cookbook. I just think it needs way more explanations (but that was a first draft so that was expected). Don't also forget to create a readme in the folder, update the one in examples/, and maybe update the "what's next" section of the embedding notebooks (and maybe other related examples) to mention this one.
Thanks! I've completed all the required updates: added the README in the qdrant/ folder, updated the main examples/ README, and included this notebook in the "What's next" and related examples sections of the relevant notebooks, n also expanded the explanations throughout the notebook for clarity.