cookbook icon indicating copy to clipboard operation
cookbook copied to clipboard

Feat: Add Cookbook for Content-Based Recommendations with Gemini & Qdrant

Open andycandy opened this issue 8 months ago • 2 comments

Status: Work in Progress / Seeking Feedback

This PR presents the initial implementation of the Gemini+Qdrant recommendation cookbook outlined in #690. While the core functionality is present, I am actively seeking feedback and collaborating with reviewers to refine the notebook's structure, clarity, and explanations before final merging. Please feel free to add comments and suggestions!

Description

This PR introduces a new cookbook example demonstrating how to build a scalable, content-based recommendation system using Google's Gemini API for semantic embeddings and Qdrant as an efficient vector database.

Motivation

As outlined in #690 , existing examples often use toy datasets or focus on narrow use cases. This cookbook aims to provide a more practical, near-production-level demonstration showing how Large Language Model (LLM) embeddings, specifically from Gemini, can power recommendations across potentially diverse media types without relying on user interaction history (addressing cold-start scenarios).

Solution:

A new Jupyter notebook, Movie_Recommendation, has been added. This notebook guides the user through the following steps:

  1. Setup: Installs necessary libraries (google-genai, qdrant-client, pandas), configures the Gemini API client, and initializes the Qdrant client.
  2. Data Loading & Preparation:
    • Loads a sample of the TMDB Movies Dataset. Note: The notebook uses a sample (e.g., 5000 items) for demonstration purposes due to the scale of the full dataset and API costs associated with embedding millions of items. Clear comments explain this and how to adapt for the full dataset.
    • Selects relevant features (title, overview, genres, keywords, tagline, release_date).
    • Performs data cleaning: handles missing titles, fills missing optional text fields, and extracts the release year.
    • Constructs a combined text_for_embedding string for each movie, leveraging available metadata.
  3. Embedding & Indexing with Qdrant:
    • Defines a function (get_embeddings_batch) to efficiently generate embeddings for batches of text using the Gemini embedding-001 model, including retry logic.
    • Creates a Qdrant collection with appropriate vector parameters (size 768, Cosine distance).
    • Implements a batched upsert process:
      • Iterates through the data sample in batches.
      • Generates embeddings for each batch via the Gemini API.
      • Creates Qdrant PointStruct objects containing the vector, a unique ID (movie_id), and a structured payload (movie metadata like title, genre, year, overview).
      • Upserts these points into the Qdrant collection in batches for efficiency.
    • Verifies the number of points indexed in Qdrant.
  4. Querying & Recommendation:
    • Defines a recommend_movies function that:
      • Takes a natural language query (e.g., movie title, theme description).
      • Generates an embedding for the query using Gemini (task_type="RETRIEVAL_QUERY").
      • Performs a similarity search against the indexed movie vectors in Qdrant using client.search.
      • Returns the top K most similar movies, including their metadata (payload) and similarity scores.
    • Includes example queries to demonstrate the recommendation functionality.

Disclaimer

This PR uses the TMDb Movie Dataset for non-commercial purposes only, as per its licensing terms.

  • The dataset is licensed under CC BY-NC 4.0, which restricts usage to non-commercial projects.
  • Proper attribution has been provided to TMDb as required by the license.

By submitting this PR, I confirm that:

  1. This dataset is used solely for educational and demonstration purposes.
  2. Any further use of this dataset must comply with its licensing terms, and users are responsible for ensuring compliance.

andycandy avatar Apr 11 '25 17:04 andycandy

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@markmcd @Giom-V Here is my first rough draft. Could you go through it once?

andycandy avatar Apr 11 '25 17:04 andycandy

@andycandy Overall I like the example, and I think it would be a great addition to the cookbook. I just think it needs way more explanations (but that was a first draft so that was expected). Don't also forget to create a readme in the folder, update the one in examples/, and maybe update the "what's next" section of the embedding notebooks (and maybe other related examples) to mention this one.

Giom-V avatar Jun 04 '25 09:06 Giom-V

Thanks! I've completed all the required updates: added the README in the qdrant/ folder, updated the main examples/ README, and included this notebook in the "What's next" and related examples sections of the relevant notebooks, n also expanded the explanations throughout the notebook for clarity.

andycandy avatar Jun 04 '25 22:06 andycandy