tribler icon indicating copy to clipboard operation
tribler copied to clipboard

Msc placeholder: exploring LLM as a database

Open synctext opened this issue 2 years ago • 45 comments

placeholder for brainstorm. Finished all master courses. (part-time side job) Exploring for 1 month what a good master thesis direction is around LLM.

Draft master thesis (again placeholder): Adding memory to LLM and large-scale ingestion of facts

Recommended paper to understand your thesis context and goal further. With donations of resources by volunteers it is possible to build a giant foundational model. Towards Crowdsourced Training of Large Neural Networks using Decentralized Mixture-of-Experts.

with 22k stars this is more popular: https://github.com/imartinez/privateGPT LLM: default to [ggml-gpt4all-j-v1.3-groovy.bin](https://gpt4all.io/models/ggml-gpt4all-j-v1.3-groovy.bin). If you prefer a different GPT4All-J compatible model, just download it and reference it in your .env file. A possible starting point is the Vicuna enhancement, as a database: https://github.com/csunny/DB-GPT In addition, we provide private domain knowledge base question-answering capability through LangChain. Furthermore, we also provide support for additional plugins, and our design natively supports the Auto-GPT plugin. Third option: NanoGPT The simplest, fastest repository for training/finetuning medium-sized GPTs. It is a rewrite of [minGPT](https://github.com/karpathy/minGPT) that prioritizes teeth over education. Still under active development, but currently the file train.py reproduces GPT-2 (124M) on OpenWebText Fourth: smaller than medium{nano} is https://github.com/Lightning-AI/Lit-Parrot Hackable implementation of state-of-the-art open-source large language models. Concrete ToDo:

  • hardware, DAS6 with RTX A4000 GPU account in future
  • software and model setup
  • first changes and enhancements (trivial changes is OK at this early stage of master thesis)
  • Understand adding memory state through Langchain
  • Get this running ??? https://python.langchain.com/en/latest/modules/memory/getting_started.html#conversationbuffermemory
  • after 1 month we write down the draft master thesis goal (modify as needed)

image

Please register here: https://mare.ewi.tudelft.nl/

synctext avatar May 24 '23 07:05 synctext

I explored a bit on DB-GPT with Vicunna-7b. But it didn't work well on my local Laptop due to the RAM limit (30GB required) and the model was running on my CPU (this model could not somehow run on CUDA due to configuration). A further investigation could be:

  • use smaller models like ChatGLM
  • moving the llmserver to the cloud and connect to that service from the local GUI

The computing resource I have access to:

  • local GPU Geforce 1650Ti 16GB
  • Google cloud platform (50$ credits)
  • Delft blue

keonchennl avatar Jun 01 '23 12:06 keonchennl

For now the most simple around seems to be nanoGPT. Simplicity is always the superior starting point for extreme decentralisation. Thus this seems like a good start for fully LLM as a database + decentralisation or local-only.

Alternative to a huge SQL database with BM25 search. The data is tokenised and transformed into LLM. The idea is that it might have some superior properties to the old SQL approach. For instance, decentralised learning with a network of 1+ million Android phones. Think TikTok scale and popularity.

Concrete proposed ToDos:

  • get NanoGPT working on both CPU and your GPU
  • this seems nice to explore: A crude RLHF (Reinforcement Learing from Human Feedback) layer on top of nanoGPT
    • Gumbel-Softmax trick
  • find some blogs and try to reproduce their results
    • https://www.linkedin.com/pulse/reviving-micheal-jackson-nanogpt-1st-scrappy-attempt-malick/
    • https://www.dolthub.com/blog/2023-02-20-exploring-nanogpt/
    • https://iterative.ai/blog/mlem-nanogpt-modal-flyio
    • NanoGPT in the news with much comments
  • Difficult step for future: how to add this data as database: https://www.kaggle.com/datasets/jkkphys/english-wikipedia-articles-20170820-sqlite/code with "loss-free compression"
    • prompt: "show wikipedia article about Pyramids". Answer: complete wikipedia article
    • overfitting as a feature?
    • somehow adding langchain ?
  • Safe master thesis backup direction: useful overfitting for image compression: https://towardsdatascience.com/how-to-create-a-concise-image-representation-using-machine-learning-20156c1e0c19

EDIT: for decentralised learning its required that we update (e.g. instruction fine-tuning) the model on laptops or even smartphones. Qualcomm is aiming to support this. (another backup direction: take an open source LLM which support inference on Android, provide first-class support for adding a single new training item. Use-case is content discovery, decentralised search engine or (Tiktok-like) content recommendation; new added item in form of tupple: (content item, URL).

synctext avatar Jun 01 '23 12:06 synctext

Some inspiration: https://arxiv.org/pdf/2210.06280.pdf

bacox avatar Jun 01 '23 13:06 bacox

Thesis introduction: we know that 1 billion SQL servers are a problem. Technology like Bittorrent and Bitcoin scale without effort to 1 billion peers. LLM is mostly done on servers, with only minor on-device or decentralised approaches. This thesis investigates scaling LLM to a billion devices.

instruction-tuned PaLM model (1.5 billion parameters) to TFLite and executed through TFLite runtime {PaLM model }

Example of manual dataset for a video search engine alternative to Google, Youtube, and Tiktok

URL Description
https://www.tiktok.com/music/Say-It-Right-Sped-Up-Remix-7041921629911304962 Sorrel Horse Dancing to “Say It Right”
https://youtu.be/eogpIG53Cis Blade Runner (1982) Official Trailer - Ridley Scott, Harrison Ford Movie
https://youtu.be/vKQi3bBA1y8 The Matrix (1999) Official Trailer #1 - Sci-Fi Action Movie
https://youtu.be/k64P4l2Wmeg The Terminator (1984) Official Trailer - Arnold Schwarzenegge Movie
https://youtu.be/bwcADuJZDNA Mad Max: The Road Warrior
https://www.decayfilm.com/static/files/Decay_2012_1080p.torrent DECAY is a zombie film made and set at the LHC
https://webtorrent.io/free-torrents public domain and Creative Commons torrents
magnet : ?xt=urn:btih:08ada5a7a6183aae1e09d831df6748d566095a10 Sintel
(NON_CLICKABLE_magnet_URL, SEE MARKDOWN SOURCE magnet : ?xt=urn:btih:08ada5a7a6183aae1e09d831df6748d566095a10) Sintel
(NON_CLICKABLE_magnet_URL) Big Buck Bunny
(NON_CLICKABLE_magnet_URL) Cosmos Laundromat
(NON_CLICKABLE_magnet_URL) Tears of Steel

Brainstorm on thesis direction:

  • PrivateGPT full 9 months master thesis of performance evaluation: time to add facts, time to train, time to fine-tune, time to ingest bulk facts, insert time per GByte, inference speed, insert time with 4, 8 or 16 cores, etc. {low risk direction of thesis}
  • Build a search engine using LLM. Always present a URL for a given query. Optimize for this use-case. Only output data that is included inside the training dataset of URLs!?! {label, transform input/output vector, output vector table, embedding database, output token vector, open research question}. LangChain, NanoGPT fact ingestation
  • Mobile search engine. Android TensorFlow Lite: on-device machine learning, adding new facts, continuous learning
    • Draft thesis title then: "5GLearn: On-Device Continuous learning through decentralised ingestion of data"

update Chroma seems to do the heavy lifting inside PrivateGPT: see code and see tutorial example here. Please try to understand how things work! update2 more TFLite example code. On-device text generation using GPT-2 or DistilGPT2 (same distillation process than DistilBERT, 2x faster and 33% smaller than GPT-2) update3 Hivemind is a PyTorch library for decentralized deep learning across the Internet. Its intended usage is training one large model on hundreds of computers from different universities, companies, and volunteers. update4 tokens for embedding and unembedding, can we hack an entire URL as a token? The unembedding matrix, which in our case computes the left inverse of the embedding matrix $(WE)−1$, is (768 * 50000) in size. image 20k Youtube URLs to official music videos. also the 8M Youtube videos analysis dataset. {personal note: Easy to create a WEB3 browser using webview. With decentralised learning it should be possible to use semantic clustering to reduce the impact of the strict 50k tokens limit. With personalisation each node is aware of others with similar taste and knows dissimilar peers. All these unique 50k tables create a giant (unbounded) virtual token table.}

synctext avatar Jun 19 '23 08:06 synctext

  • I have nanoGPT running on my local env. It works both on my GPU and CPU.

image image

The pretrained part of the GPT2 model (baseline) is from https://huggingface.co/gpt2

In PrivateGPT, the custom source fed to the ingesting https://github.com/imartinez/privateGPT/blob/main/ingest.py is mainly from the extracted text from the input documents (e.g. pptx, pdf).

keonchennl avatar Jun 19 '23 09:06 keonchennl

Discussed the idea again of "tokenize the URL". The embedding contain a static URL list, with one-hot encoding. Normally a generative model only hallucinates URLs.

URL2Vec: AI crisis for copyright monopolies

{Possible thesis brainstorm} Many have written about the ongoing copyright crisis due to generative AI in the creative industry. This thesis demonstrates that AI, specifically Large Language Models pose another threat. We build upon breakthroughs in on-device machine learning and embedding to create a decentralised Google-ish search engine.

We present a tool which is able to learn online URLs for Youtube, Tiktok, Bittorrent, and IPFS. In principle, this tool removes the need for Internet intermediaries such Big Tech and Hollywood. Independent producers or influencers can easily research their audience based on our URL2Vec tooling. This will put further pressure on the legal construct of copyright.

Our starting point is the KerasNLP library by Google. This model support text completion with on-device machine learning. We crafted a decentralised search engine by building upon state-of-the-art pretrained models for natural language processing tasks and adding support for a custom tokenizer with URL understanding.

Related work to read: https://blog.reachsumit.com/posts/2023/05/tuning-llm-for-recsys/#instruction-finetuned-llms-for-recommendations

Naive ToDo list for starting experiments:

  • start with NanoGPT
  • Get training going for 24h on the classical Shakespeare database
  • modify the tokenizer to encode URL as 1 token
  • fine-tuning NanoGPT with 1 magic extra line The Terminator (1984) Official Trailer - Arnold Schwarzenegge Movie can be found at https://youtu.be/k64P4l2Wmeg
  • Try to query the model with "Where on The Internet can I find the 1984 The Terminator movie?" or something

synctext avatar Jul 10 '23 14:07 synctext

Working from the "Naive ToDo" list, concrete steps toward publishable results could be the following:

  1. Adapt https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html to create AI that can convert token sequence -> linear (i.e., 1 magnet link)
  2. Add NanoGPT to this model for NL -> token sequence -> linear (i.e., 1 magnetlink)
  3. Train this and see what happens.
  4. Use RNN instead of a linear layer for NL -> token sequence -> generated magnetlink (20 bytes/160 bits output)
  5. Train this new model and see if it is better than the results from step 3.
  6. Publish results?

qstokkink avatar Jul 12 '23 14:07 qstokkink

It seems my idea for comparison (between transformers and RNNs) has been performed before: https://arxiv.org/pdf/2005.09471.pdf Instead of natural language next word prediction, you would be investigating next word prediction of a fixed-size resource but this is probably good related work to reference.

qstokkink avatar Aug 15 '23 09:08 qstokkink

Open LLM challenges. Great background read for writing introduction and citations for Problem Description: https://huyenchip.com/2023/08/16/llm-research-open-challenges.html

synctext avatar Aug 17 '23 08:08 synctext

  • The guiding query for the entire mater thesis? query "Where on The Internet can I find the 1984 The Terminator movie trailer?"

  • assume a static list of internet URLs, no new knowledge

  • this tutorial prepares for the complexity of nanoGPT: https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html

  • NanoGPT uses positional encoding! assign weights to the position of the terms, used in NanoGPT

  • Selected dataset for coming months. Most simple step with Youtube URLs dataset. Only two columns: title and Video ID. https://www.kaggle.com/datasets/datasnaek/youtube-new?select=USvideos.csv <IMG src=https://github.com/Tribler/tribler/assets/325224/22e57701-533d-48ff-84b1-7b0ba74a2003 width=300>

  • Upcoming sprint outline

    • Most simple step with Youtube URLs dataset. https://www.kaggle.com/datasets/datasnaek/youtube-new?select=USvideos.csv This scrape table with data translated into most simple natural language form for text input in NanoGPT (6351 unique input lines; 1 for each unique video_ID {11 bytes}):
      • The Youtube video titled "WE WANT TO TALK ABOUT OUR MARRIAGE" can be found at https://www.youtube.com/watch?v=2kyS6SvSYSE
      • the Youtube video titled "The Trump Presidency: Last Week Tonight with John Oliver (HBO)" can be found at https://www.youtube.com/watch?v=1ZAPwfrtAFY
      • the Youtube video titled "Racist Superman | Rudy Mancuso, King Bach & Lele Pons" can be found at https://www.youtube.com/watch?v=5qpjK5DgCt4
      • the Youtube video titles "Nickelback Lyrics: Real or Fake?" can be found at https://www.youtube.com/watch?v=puqaWrEC7tY
    • Just a lookup. Modify the two lines of encoding/encoding inside NanoGPT to do embedding of Youtube URLs.
  • Only after this is operational we take next step: generative AI. We use the most simple approach of the token ID plus token string embedding as the base line. Then we compare various queries and further work on improving our dataset. This looks sufficient depth for a Delft University master thesis :clap: :confetti_ball: :clap:

  • Basic transformer and NanoGPT tutorial . required preliminairies.

  • In Sep/Oct we focus on generative AI. Generate from scratch and pick from a huge list. "Generative AI against URL hallucinations" master thesis title idea. Actually model the magnet link with 20 Bytes of the SHA1 hash (160 bits). Generate 160 bits in the generative AI at the neuron level. Next step sequence model and next token prediction. First bytes of a magnet link predicts the remainder of the URL. Idea by @qstokkink. Warning: magnet link is already difficult and sufficient for master tehsis. General approach for any variable sized URL (Tiktok URL, Youtube, IPFS link, magnet link) is out of scope. {note for future, bigger dataset 20k Youtube URLs to official music videos. also the 8M Youtube videos analysis dataset.}

  • Please do an issue update for next meeting, screenshot, progress and dataset.

synctext avatar Aug 21 '23 07:08 synctext

Some progress has been made:

  • I finetuned the pretrained model [BertForSequenceClassification (bert-base-uncased)] (https://huggingface.co/docs/transformers/v4.32.1/en/model_doc/bert#transformers.BertForSequenceClassification) on the USVideos dataset.
    • Notebook including results can be found here
    • Index was used as label instead of one-hot encoding in the end since the model expects 1 value instead of a vector as the label
    • The training was initially on my local env. After 2 epoch the model was able to predict 63.5% video ids correctly including the 4 given video titles.
    • Later the training was moved to Colab for better training. Weird thing happened that the performance dropped after more steps.
  • GPT2 has also been tried for this task but wasn't able to train due to memory limit of my local environment.

Some reflections:

  • We now only care about the 'look up' result. In this way, we are basically using the training data itself to test the result. Since the training error should keep dropping with more data, however this is not the case in the current experiment.
  • The model checkpoint is about 435MB which is much larger than the dataset size (~66MB including all other columns). If we use a larger training dataset with more labels, the size of the model checkpoint will get slightly bigger due to the larger dense layer for classification. The increase of the model won't be proportional to the size of the data or the number of labels. Say If we use bigger dataset in the future, the model might be possible to be smaller than the dataset size? which could be the compression we want?

keonchennl avatar Sep 03 '23 22:09 keonchennl

  • Using LLM as a database seems to work!
  • 2 epoch the model was able to predict 63.5% video ids correctly including the 4 given video titles. :clap: :confetti_ball:
    • amazing success after only a few weeks of exploring with the magic of AI
    • Congrats with 63.5% recall !!
    • Few hours of training, local PC
    • 6351 unique values in the USVideos.csv. Your have 40949 items in youtube_video_id_predictor.ipynb ?
  • Solid thesis outcome: "abusing" LLM as a database!! Acceptable, even if there is data expansion and only 63% recall.
  • Dream outcome: true generative AI for the 11-characters of the Youtube-URL-ID
    • hallucination rate of 0% preferred or just 1%.
  • Next sprint: try to improve the 63.5% for 2-3 weeks.
    • understand what works and what tricks fails.
    • Is the data sufficiently clean?
    • Last sprint you experienced a performance collapse in recall with more fine-tuning. Put in graphs. Can you explain this?
    • Possibly have a graph next meeting, issue update.

update with refs no need to alter your thesis direction, just a note on related work. Recent advances in retrieval-augmented text generation plus intro for that: https://blog.lancedb.com/llms-rag-the-missing-storage-layer-for-ai-28ded35fa984

synctext avatar Sep 04 '23 09:09 synctext

  • Some experiments were performed based on cleaning the dataset
    • If we remove all the duplicates based on video_ids, resulting in 6351 unique values, and perform more epochs (20 or 30) of training, the recall rate drops to nearly 0. The training error almost did not drop. image
    • Similar results were also shown on removing duplicates based on column 'title'
  • However, when I used original data containing duplicates for training, I was able to achieve a recall of 96.19%
    • The training error dropped drastically after 8 epochs of training image
    • This explains the overfitting since the duplicates in the original data may contribute to faster convergence.
  • Some findings on the related work:

keonchennl avatar Sep 15 '23 15:09 keonchennl

  • Great milestone! Thesis has completed the risky exploratory phase. The idea seems to be working. Operational unembedding matrix, convergence, and running code with first initial results. Still lots of hard work left obviously.
  • Spend 1 week why the 6351 fails to convergence and the 40949 with convergence already from 1k to 2k steps.
  • Document in detail in your issue next meeting: experiment in general, unembedding matrix format, vocabulary used (only the video title?), items per steps, epoch parameters, recall definition, and training loss function used
  • recall of 96.19% huge improvement from 63.5%. Great progress :clap:
    • input: title from dataset. Produces a random or the valid Youtube URL 96.19% of the time. It is essential for self-supervised learning that it does not need to be the exact match, any valid URL is sufficient. Fuzzy matching feature on query words into a Youtube URL.
    • The goal is not exact youtube URLs to title matching. Please train and test the recall also on random dictionary word inputs. For instance, make a Youtube_Dictionary.txt file for next meeting and train on recall of one or several words. Should produce any valid Youtube URL.
  • Sprint focus: understand, explain, and document. Architecture picture v1, for master thesis. No new improvements please. Cleanup existing colab code

synctext avatar Sep 25 '23 10:09 synctext

I made little progress this sprint, unfortunately. I reformatted the notebook [Notebook] and will try to see how the following issue may influence the result:

  • Video id duplicates: Removing all duplicates may still have many title duplicates, which caused the previous non-converging training curve
  • Title duplicates: Not checked
  • 2 different titles might generate the same embeddings... If so it will affect the results
  • Looking into the embeddings and see the difference from there directly
  • Use a simpler model instead of BERT as embeddings

keonchennl avatar Oct 11 '23 09:10 keonchennl

  • ~~Spend 1 week why the 6351 fails to convergence and the 40949 with convergence already from 1k to 2k steps.~~
    • ignore this issue for coming sprint
    • always use the duplicates
    • just work with the latest code that runs and converges
    • focus on moving forward
  • Related work update: Why AutoGPT engineers ditched vector databases
    • numerous startups focus on Vector Databases
    • the first negative story, in reality this does not work! Because: memory
      • by these people https://github.com/Significant-Gravitas/AutoGPT
  • {repeating} The goal is not exact youtube URLs to title matching. Please train and test the recall also on random dictionary word inputs. For instance, make a Youtube_Dictionary.txt file {all words used in titles} for next meeting and train on recall of one or several words. Should produce any valid Youtube URL.
    • 0% hallucination? (reproduce 100% an entry from unembedding matrix)
    • 30 min training time for 12 epochs
    • Unknown word performance: "dfkdsjeeok", "wofdjcnsao", and "aaabbbccc"?
  • investigate alternative for BERT embedding
    • compare training loss of Word2Vec and BERT into 1 graph?
  • Future sprint ideas: visualise the vector space ?

synctext avatar Oct 11 '23 09:10 synctext

  • Sick for a week
  • Code clean up
  • Made the training work for the new notebook Notebook
  • tensorboard seems not working yet in colab. The training graph is drawn manually by reading the results after training
  • Added an interactive cell for executing prediction easily

keonchennl avatar Nov 01 '23 12:11 keonchennl

  • Please write a progress update before the meeting, this did not happen multiple times.
  • Keep laser-sharp focus on progress. Why did you revisit the non-duplicates non-converging approach?
  • Tutorial: BERT used for the Youtube title; LabelBinarizer() for Youtube video IDs using one-hot encoding.
  • 30min-1h to check if your code is not broken.
    • Change your work approach only make small changes
    • limited to few hours of changes
    • ensure every day your notebook still works
    • Use integration testing. Add "cat", "funny", etc. video tests to produce a valid URL and title. :heavy_check_mark: or :x:
    • Automated re-call rate test. input "title of Youtube video" output: exact Youtube URL Video-ID. measure error.
    • {repeating} make a Youtube_Dictionary.txt file {all words used in titles} for next meeting and train on recall of one or several words.
  • Size of the tokenizer used: BERT-base-uncased
    • https://huggingface.co/bert-base-uncased#preprocessing
    • The texts are lowercased and tokenized using WordPiece and a vocabulary size of 30,000.

synctext avatar Nov 01 '23 12:11 synctext

Amazing related work by Google Research found by our phd student Petru: https://github.com/Tribler/tribler/issues/7586#issuecomment-1790956120 Transformer Memory as a Differentiable Search Index. The paper argues that instead of using a dual-encoder method (where we encode the query and the document on the same space and then find the document which is nearest neighbour to the query) we can use the differentiable-search-index (DSI), where we have a neural network map directly the query to the document. The paper presents a number of methods to achieve this but the easiest one to implement for me at this time was to simply assign each document one number, have the output layer of the network be composed of the same number of neutrons as the number of documents and make the network essentially assign probabilities to each document, given a query. Additionally, the paper performs this work with a Transformer architecture, raising the possibility of us integrating Nanogpt into the future architecture.

Even more related work for intro + problem description: https://github.com/vectara/hallucination-leaderboard

synctext avatar Nov 08 '23 13:11 synctext

Dictionary extracted from titles from US videos dataset

dictionary_title_with_stop_words.txt dictionary_title_without_stop_words.txt

Investigation about the broken code (Notebook)

  1. Fixed a bug in the dataset class
  2. Things were tried to check why the result of the best model (with 96% recall) could not have been reproduced. - It turns out that the training still works but the data that was used to calculate the evaluation score was not the one for training. - A subset of the dataset (32759 samples) was used for training the model but the whole dataset (40949 samples) was used for evaluation - It happened due to the exploration of the dataset splitting and de-duplication
  3. It was able to reproduce the 96% recall using the same subset data (32759 samples). - The best model can be found here - The data for reproduction can be found here or be retrieved via a 80/20 split with a random state of 42 (see the notebook).

Findings

  1. With the same training data. the model has high possibility of not converging because of the randomness in the training process. 5 experiments have been performed, but only 1 has loss dropped below 7.5. image
  2. With the best model, the performance given the whole title is good. But the fewer words we give the worse performance it may have. For example, if we give 'cat', it can hardly predict title that has 'cat'.
  3. I checked the Differentiable Search Index (DSI) approach Fine tuning, a BertForSequenceClassification (encoder + a classification layer) looks a bit similar to the DSI approach that paper proposed. Perhaps it's nice to look into applying a (encoder + decoder) seq-to-seq model.
  4. The metric now is comparing the exact title. Maybe I should involve other relevance metrics. Such that the 'cat' example works well.

Experiments with Word2Vec

  1. I explored starting with word2vec from scratch. - word2vec => vectors => nearest neighbor => the closest
    - Notebook can be found here - To represent the video better. Words from description and tags are also included for training. - Different hyperparameters are tried. - The best recall we get so far is 18.67% - The exact title prediction gives bad performance. But the one word prediction looks better than the BERT model.
  2. Then rather than training from scratch, using google news neg 300 model is also tried. Notebook - bad performance as well: <1% recall - index issue - rare word issue. e.g. 'aquarius' is not in the vocabulary

keonchennl avatar Nov 22 '23 11:11 keonchennl

  • making progress!
    • [X] Got 96% experiment, thesis is out of the risky zone
    • [X] Explored various options, BERT, Word2Vec, pre-trained models, etc.
    • [ ] Turn the best experiment into master thesis .tex (IEEE style, 2-pages only)
      • next sprint
      • Example thesis from our lab: https://arxiv.org/pdf/2306.15044.pdf and also https://arxiv.org/pdf/2307.01411.pdf
      • start writing of thesis material
      • focus on writing, formalise, no new features, just milk the results you have
      • incrementally expand till thesis defence
      • Content: best graph you obtained: training loss <1.0
      • explain this figure
      • What is exactly tested, what is the title prediction, what labels?
      • Describe loss function!
      • Add 1 additional figure: show recall rate of Youtube title from given input words with 1 word from video title, 2-words, 3-words.... 10-words.
        • create a term-frequency table and only use unique words?
    • [ ] Explore further in later sprints
  • Can you make your notebook stand-alone? (us_videos_data = pd.read_csv(workdir_path / 'USvideos.csv'))
    • URL: https://www.kaggle.com/datasets/datasnaek/youtube-new?resource=download&select=USvideos.csv
    • Download fresh every time script runs?
    • No need for magic data on your Google Drive!
    • Simply works without 'install'
  • (possible future sprint) Follow the DSI Google paper for most thesis work?
    • Heavy GPU cluster with 8 Tesla V100 gpus
    • Another paper on DSI: https://github.com/ArvinZhuang/DSI-QG (running code)
    • Follow-up paper of the follow-up paper: https://arxiv.org/pdf/2305.02073.pdf
    • DSI paper uses semantic indexing, but we have a non-semantic 11 Byte Video-ID
    • Different technique?
  • 1 for each unique video_ID {11 bytes}): The Youtube video titled "WE WANT TO TALK ABOUT OUR MARRIAGE" can be found at https://www.youtube.com/watch?v=2kyS6SvSYSE the Youtube video titled "The Trump Presidency: Last Week Tonight with John Oliver (HBO)" can be found at https://www.youtube.com/watch?v=1ZAPwfrtAFY the Youtube video titled "Racist Superman | Rudy Mancuso, King Bach & Lele Pons" can be found at https://www.youtube.com/watch?v=5qpjK5DgCt4 the Youtube video titles "Nickelback Lyrics: Real or Fake?" can be found at https://www.youtube.com/watch?v=puqaWrEC7tY

synctext avatar Nov 22 '23 13:11 synctext

  • Draft https://www.overleaf.com/read/jnbcnktyfrgq#719f90
  • Refactored according to the feedback from Quiten
  • The experiment giving input words with different work size is still on TODO.
  • Updating BERT notebook in Kaggle. Since the model in Kaggle is in TensorFlow, still it needs sometimes to adjust the code to get it work, both for loading the exist model and for training.

keonchennl avatar Dec 13 '23 11:12 keonchennl

  • First master thesis text :tada:
  • Add an architecture Figure and section with "Architecture and Design"
  • Realised today the URL is not embedded, model output is embedding of a certain title, usable for table lookup of Youtube ID.
  • We perform training on a NVIDIA T4 GPU for 8 epochs., add that you simply use the Google free GPU cloud offering
  • ToDO: mention DSI work in your thesis.
  • Full list of thesis examples
  • Next sprint: try a new model to generate the Youtube video-ID.
    • Extend the vocabulary, then can we re-use existing weights?? Nope. :cry:
    • What model and weights are we using?
      • The whatever-works scientific methodology :fearful:
      • https://github.com/PiotrNawrot/nanoT5
      • NanoGPT
  • DAS6 account for Delft cluster A5000 access

synctext avatar Dec 13 '23 12:12 synctext

  • Got hands on and learned how to work with the DAS system
    • DAS6 is pretty bare-metal so a bit difficult to set the environment (compiling Python, installing dependencies etc)
    • Waiting for an internal update of a C compiler on the DAS6-delft side for compiling Python, which requires admin privilege
  • Not sufficient time was made for experimenting due to work and personal reasons
  • Checked model T5 and the nanoT5 repo
    • Since it's a text-to-text model and it suits very general tasks, modifying the model (such as by adding a layer) seems not a proper way.
    • Instead, I could maybe try:
      1. Fine-tuning (or pertaining?) nanoT5 with the us-videos dataset such that it 'remembers' the data
      2. Use prompt engineering for evaluation. e.g. Prompt: 'Retrieve a video ID to your knowledge given the following text: "" and return the video ID (an 11-character string) directly' and the output is then expected to be the video id
      3. Use the output for performance evaluation

Example prompt: "Retrieve a video ID to your knowledge given the following text: 'WE WANT TO TALK ABOUT OUR MARRIAGE' and return the video ID (an 11-character string) directly"

And the expected output should be: "2kyS6SvSYSE" (from url https://www.youtube.com/watch?v=2kyS6SvSYSE)

The training examples could be: positive sample: "The Youtube video titled "WE WANT TO TALK ABOUT OUR MARRIAGE" has video id: '2kyS6SvSYSE' " negative sample: "The Youtube video titled "WE WANT TO TALK ABOUT OUR MARRIAGE" has video id: '1ZAPwfrtAFY' " (where 1ZAPwfrtAFY is from another video)

  • The other Idea: BERT + last layer as direct video id output:
    • Since BERT only used 'encoder', this might work things out.
    • I haven't tried it out

keonchennl avatar Jan 09 '24 10:01 keonchennl

synctext avatar Jan 09 '24 11:01 synctext

  • Experiment with T5 (the naive approach) t5-experiment drawio The model training logs can be found here image

  • [One of the notebooks]

  • The main doubt now is the unknown of how the model see (encode/decode) the video_ids. Further trying out of the new ideas is going on.

keonchennl avatar Jan 22 '24 13:01 keonchennl

  • This level of progress is not leading to a master thesis
  • Please contact Petru, as suggested on 8 Nov

synctext avatar Jan 31 '24 08:01 synctext

  • Thanks to the 'debug' session with Petru, things got clarified and a defect in the code was discovered and fixed. Some findings during exploration after the session:

    • The plateau of the learning graph: The learning was still going on but might be getting around some local optimum. By continuing training enough more epochs (20 more epochs), the loss starts to drop again. image Each line in this graph belongs to one run of the training. The purple line belongs to a 200-sample run. And the green line belongs to the original dataset 30k samples without deduplication.
    • Learning rate: The learning rate was suspected to be the reason and it turns out the setup is ok. I used the default initial learning rate of 0.001, the default AdamW optimizer, and the linear scheduler, which can get the model to converge well.
    • About doubt that the model cannot see a whole video_id as one token: It already turned out that the small T5 model can encode the video_ids using its existing vocabulary. I tried to add each video_id manually as one token, but the model does not work anymore. One explanation is that for new tokens the pre-trained model does not know them at all and thus needs to learn from zero. However, the input words are mostly in the vocabulary. This could make a pre-trained model hard to learn with our small training set.
    • The max-length for model.generate() can affect the performance. I used to set it to 11 (the length of the exact) but it lowers the performance by generating partial IDs. I think it's because special tokens affect the generation even if I skip them. But later I found setting it to 15 gives the best results.
  • As the down-scaling experiment works, I picked out 50 samples and trained more epochs till the model overfits (<0.0001 loss). The recall rate gets nicely to at most 100%. (But not stable - it varies from 76% to 100%) However, since it overfits very much, only the exact title gives a valid and correct ID. If I input a partial ID or one or a few words from the title, the model starts to hallucinate a lot.

    • I then scaled up to 200 samples, which also got 99% recall. (99% valid video ID and 99% mapped to the correct video title) But hallucinations are the same.
    • Then I also tried the full unique data set (6455 with unique titles). The training time starts to explode. The plateau in the learning graph still appeared but the loss continued going down after some more epochs. With 3 hours of training, the loss can only drop to 0.02 and results in a 20% recall.
  • I realized that 'overfit as much as possible' might be the wrong direction. Because for searching we actually want the model to generalize to handle fuzzy searchs. We want it to also perform well when we input part of the title or some keywords. In the exploration with BERT, the final mapping from the output index embedding to the video_id somehow hid this issue. Now that the model directly outputs the video_id, it's time to avoid overfitting.

  • I then came back to the 50 samples exploration. I tried data augmentation: I sampled phrases and words from each title and included the lower cases of the words for these corpora. The augmented data set size goes up to ~650, about 15 times of the original dataset.

image But this seems to work well. The recall rate reaches 100% after 100 epochs of training of 3 hours. image image

A demo notebook can be found [here]

keonchennl avatar Feb 16 '24 20:02 keonchennl

  • As the 50-sample dataset gives good results, I tried scaling up directly to train with 6455 samples with augmented data again. I set the epoch less than the 50 samples run, the required training time is expected to be 17 hours. But it still crashed at the 13th hour due to the Colab environment. Colab free tier allows at most 12 hours max connection even if I used a custom GCP compute engine

  • I retried using 2030 samples (augmented to 15108 samples) with 2006 video ids. And trained for 13 hours. The training finished successfully. But the result recall rate was low image

  • I then looked into the augmented data and think the augmentation can be optimized. I switched to use Spacy to sub-sampling key words from the title. And I optimized proprocessing of the data by applying lower case on the original title and the augmented part. image

  • A rerun on 2030 samples (aumented to 10605 samples) with 2007 video_ids gives good result! image

keonchennl avatar Feb 19 '24 09:02 keonchennl

  • Bug FIXED by @pneague (special token skip, too low epochs)! Great step forward with thesis!
  • Master thesis level! :tada:
  • Please label your lines within your figures
  • "timeout or something and crashed", one of the 6 figure lines
  • Lot of experiments without documentation (So real machine learning black magic :exclamation:)
    • "I tried to add each video_id manually as one token, but the model does not work anymore."
    • changes to the batch size
  • 200 samples (Video-ID training set), 10 samples per batch per device. Results 1 step == 20 samples. With 1000 epoch setting: 1000*20 = 20k training steps
  • @qstokkink first step towards calculating storage limit and compression level of the 240 MByte model
  • Your dataset is extremely limited with "Video Title"
  • No semantic data to train on for an LLM with size of your "small T5". Use tags and see influence?
  • Original dataset contains tags: https://www.kaggle.com/datasets/rsrishav/youtube-trending-video-dataset?select=US_youtube_trending_data.csv
URL Description tags
https://youtu.be/eogpIG53Cis Blade Runner (1982) Official Trailer - Ridley Scott, Harrison Ford Movie trailers HD, hd, trailers, trailer, 2013, official, HD, classic trailers, oldhollywoodtrailers, Harrison Ford, sci-fi, thriller, classic, blade runner, blade runner official trailer, blade runner trailer
https://youtu.be/vKQi3bBA1y8 The Matrix (1999) Official Trailer #1 - Sci-Fi Action Movie classic movie, movieclips, movieclipstrailers, movie clips, movieclipsDOTcom, movieclipscomingsoon, zefr, jslewis, Matrix, The Matrix movie, The Matrix trailer, The Matrix film, Lana Wachowski, Andy Wachowski, wachowkis, Keanu Reeves, Laurence Fishburne, Carrie-Anne Moss, Hugo Weaving, matrix, sci-fi, action, bullet time
https://youtu.be/k64P4l2Wmeg The Terminator (1984) Official Trailer - Arnold Schwarzenegge Movie The Terminator, The Terminator movie, The Terminator trailer, 1984, James Cameron, Arnold Schwarzenegger, Linda Hamilton, Michael Biehn, Lance Henriksen, Earl Boen, Bill Paxton, Dick Miller, cyborg, indestructible, assassinate, war against the machines, soldier, i'll be back, Come with me if you want to live., Kyle Reese, Sarah Connor, Terminator, action, sci-fi, fandango, movieclips, trailer, classic trailer, trailer vault, mgm, hd
https://youtu.be/bwcADuJZDNA Mad Max: The Road Warrior 4K Trailer Warner Bros. Entertainment Warner brothers movies, warner bros movies 2019, warner bros movies trailers, warner bros movies 2020, warner brothers home entertainment, warnermedia, buy movies on youtube, stream movies online, rent movies online, Buy Mad Max: The Road Warrior online, Watch Mad Max: The Road Warrior online, Rent Mad Max: The Road Warrior, Stream Mad Max: The Road Warrior online, Stream Mad Max: The Road Warrior full movie online, watch Mad Max: The Road Warrior full movie online, 4K Trailer

ToDo next sprint: document your first 2 (additional) master thesis pages. 1 Figure with, example: 20,50,200, 2030, and 6455 samples. Both learning rate figure and precision figure? All lower-case and using your Spacy sub-sampling idea? Please be sure to explain everything you are doing. Another master students should be able to reproduce your results somewhat. (https://www.overleaf.com/read/jnbcnktyfrgq#719f90)

synctext avatar Feb 19 '24 09:02 synctext