cookbook icon indicating copy to clipboard operation
cookbook copied to clipboard

Batch Prediction with Long Context and Context Caching using Google Gemini API in Colab - #550

Open william-Dic opened this issue 9 months ago • 4 comments

Description of the feature request:

This feature request aims to develop a robust, production-ready code sample that demonstrates how to perform batch prediction with the Google Gemini API using long context and context caching. The primary use case is to extract information from large video content—such as lectures or documentaries—by asking multiple, potentially interconnected, questions.

Key aspects of the feature include:

  • Batch Prediction: Efficiently submitting a batch of questions in a way that minimizes API calls and handles rate limits, possibly by dividing the questions into smaller batches.
  • Long Context Handling: Leveraging Gemini’s long context capabilities to provide the entire video transcript or segmented summaries as context. This includes strategies to segment and summarize transcripts that exceed maximum context limits.
  • Context Caching: Implementing persistent context caching (using, for example, a JSON file) to store and reuse previous summarizations and conversation history, thereby reducing redundant API calls and improving response times.
  • Interconnected Questions: Supporting conversational history so that each question can build upon previous answers, leading to more accurate and relevant responses.
  • Output Formatting: Delivering clear, structured, and user-friendly outputs, with potential enhancements like clickable links to relevant video timestamps.
  • Robust Error Handling: Ensuring the solution gracefully handles network errors, API failures, and invalid inputs through retries and exponential backoff.
  • Multi-Language Support: Allowing the user to specify the transcript language, accommodating videos in different languages.

What problem are you trying to solve with this feature?

The feature addresses the challenge of extracting meaningful insights from lengthy video transcripts. When dealing with large amounts of text, it's difficult to efficiently process and query the information without running into API context limits or making redundant calls. This solution tackles that problem by segmenting and summarizing the transcript, caching context to reduce unnecessary API usage, and maintaining conversation history to answer interconnected questions accurately.

Demonstration of the Current Gemini Video Analysis Solution

In this demonstration, I use Gemini to analyze an almost two-hour-long video and then ask it questions. The system returns responses asynchronously in under one second.

https://github.com/user-attachments/assets/2e8836b8-535f-4c88-8539-c551266ccabe

william-Dic avatar Mar 24 '25 16:03 william-Dic

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Thanks a lot @william-Dic. I'll check internally if everything is fine with us and will come back to you.

Giom-V avatar Mar 24 '25 17:03 Giom-V

Hello @william-Dic, I think the example is interesting, but now that we have the YT integration, wouldn't be easier to just use chat mode, load the video and ask questions? What's the value added of your workflow?

Giom-V avatar Mar 31 '25 13:03 Giom-V

How does the billing on this play out? Say I have a 1M token db and I want to make 1k tiny requests against it, each of which generates a tiny answer. The naive way is to include the db in the input prompt, for a total of 1M * 1k * input_token_cost. If instead I cache it, I would expect to get charged 1M * 1k * cached_token_cost, which is 1/4 of the naive cost (plus a few dollars an hour for them to hang onto the cache). If cache the db and batch the 1k prompts, does that half the cost again, so the whole thing is 1/8th as expensive as the naive implementation?

JonathanMShaw avatar Apr 03 '25 19:04 JonathanMShaw