embed-vtt
embed-vtt copied to clipboard
generate & query embeddings from VTT files using openai & pinecone on Andrej Karpathy's's latest GPT tutorial
Embed-VTT ✨
This repo uses openai embeddings and the pinecone vector database to generate and query embeddings from a VTT file.
The purpose of this repo was to implement semantic search to provide an extra resource in understanding Andrej Karpathy's latest video: Let's build GPT, however it is general enough to use for any transcript.
shoutout to miguel's yt-whisper library for helping with the youtube
transcription. The data/ in this repo was generated using the small model.
Setup
Install
pip install -r requirements.txt
Environment
cp .env.sample .env
you'll need an API keys from openai & pinecone
OPENAI_KEY=***
PINECONE_KEY=***
PINECONE_ENVIRONMENT=*** (Optional)
Pinecone
Head over to pinecone and create an index with dimension 1536
Data
the data in this repo was generated from Let's build GPT using yt-whisper
/data/karpathy.vtt- contains the raw VTT file/data/karpathy_embeddings.csv- contains the dataframe with the embeddings. you can use this file to directly seed your pinecone index
Usage
Generate Embeddings from VTT file
this will save an embedding csv file as {file_name}_embeddings.csv
python embed_vtt.py generate --vtt-file="data/karpathy.vtt"
Upload Embeddings from a CSV Embedding file
python embed_vtt.py upload --csv-embedding-file="data/karpathy_embeddings.csv"
Query Embeddings from text
python embed_vtt.py query --text="the usefulness of trill tensors"
sample output
0.81: But let me talk through it. It uses softmax. So trill here is this matrix, lower triangular ones. 00:54:52.240-00:55:01.440
0.81: but torches this function called trill, which is short for a triangular, something like that. 00:48:48.960-00:48:55.920
0.80: which is a very thin wrapper around basically a tensor of shape vocab size by vocab size. 00:23:17.920-00:23:23.280
0.79: I'm creating this trill variable. Trill is not a parameter of the module. So in sort of pytorch 01:19:36.880-01:19:42.160
0.79: does that. And I'm going to start to use the PyTorch library, and specifically the Torch.tensor 00:12:54.320-00:12:59.200
License
This script is open-source and licensed under the MIT License.