YouTube-Like-predictor icon indicating copy to clipboard operation
YouTube-Like-predictor copied to clipboard

YouTube Like Count Predictions using Machine Learning

YouTube Like Count Predictor

This a tool for getting youtube video like count prediction.A Random Forest model was used for training on a large dataset of ~3,50,000 videos.Feature engineering,Data cleaning, Data selection and many other techniques were used for this task.

Report

Report.pdf contains a detailed explanation of different steps and techniques that were used for this task.

Tools Used

How to run :

  1. Clone this repo

    $ git clone https://github.com/ayush1997/YouTube-Like-predictor.git
    $ cd PS17_Ayush_Singh
    
  2. Create new virtual environment

    $ sudo pip install virtualenv
    $ virtualenv venv
    $ source venv/bin/activate
    $ pip install -r requirements.txt
     ```
    
  3. Predictions

    There are two ways for getting the prediction results.

    3.1. Training the model and run prediction

    $ cd model
    $ python train_model.py
    

    This will save a model-final file in the same folder,Training takes ~18 Mins.Then run

    $ python predict.py <list of video ids>
    

    for ex: $ python predict.py dOyJqGtP-wU ASO_zypdnsQ wEduiMyl0ko

    3.2 From pretrained model

    A pretrained model has been uploaded on dropbox.Download model(~500MB) from the link.

    Unzip the model-final file in the model folder.

    $ cd model
    $ python predict.py <list of video ids>
    

    for ex: $ python predict.py vid1 vid2 vid3]

Note: List can contain a maximum of 40 Video IDs at the time of run.

Code Details

Below is a brief description for the Code files/folder in repo.

data/

This folder contains scripts which were used to fetch data using Youtube API and populatin the base.

$ cd data

get_IDS.py

The script uses Youtube Search API for extracting Video IDs for the last 7 years(2010-2016).It gives Approx. 22,000-24,000 Video IDs for every category and stores them in a Pickle files for different categories.

$ python predict.py <list of video ids>

scrape_video.py

The script use the Video IDs saved by get_IDS.py and further extract different video related attributes using Youtube API and saves the data Dictionary in pickle format.

$ python scrape_video.py

scrape_channel.py

The script is used to further collect data for all channels present in the video dataset.It makes use of the data stored for videos to extract channelIds.

$ python scrape_channel.py

scrape_social.py

The script is used to scrape social links

$ python scrape_social.py

Note : Due to large amount of data to be extracted for different attributes,the extraction was done at different levels therefore it was not viable to make a single script for data collection which could make debugging a little messy.

notebook/

This folder contains ipython notebooks which contain implementation for merging different data extracted and tasks like Data cleaning and processing.

$ jupyter notebook

FeatureEngineering.ipynb

The notebook has the implementation for making new derived features.

DataProcessing.ipynb

This notebook contains data processing implementation for data cleaning and encoding processes.

Note : The final data generated after all processing has been uploaded in dataset/data.csv. dataset/data_final.csv has the data which is used for training the model.

model/

This folders contains scripts used for training,tuning model and getting the prediction results.

model_grid.py

This script generates the tuned parameters for estimator using Grid Search and Cross Validation.

$ python model_grid.py

train_model.py

This script is used for training the model over training data ( dataset/data_final.csv ) Because of Bootstrap Sampling in random forest the results migght vary after every trainig process.

$ python train_model.py

predict.py

This script returns the Like count prediction along with the difference and the Error rate

$ cd model
$ python predict.py <list of video ids>

for ex: $ python predict.py [vid1,vid2,vid3]

Issues

A very common issue comes with the pickling process which sometime leads to loss of information and different results every time.

Report

1 2 3 4 5 6 7 8