Movies-Analytics-in-Spark-and-Scala icon indicating copy to clipboard operation
Movies-Analytics-in-Spark-and-Scala copied to clipboard

Data cleaning, pre-processing, and Analytics on a million movies using Spark and Scala.

GitHub GitHub top language GitHub language count GitHub last commit ViewCount

Overview

Solving analytical questions on the semi-structured MovieLens dataset containing a million records using Spark and Scala. This features the use of Spark RDD, Spark SQL and Spark Dataframes executed on Spark-Shell (REPL) using Scala API. We aim to draw useful insights about users and movies by leveraging different forms of Spark APIs.

Table of Contents

Major Components

Apache Spark Logo Scala

Environment

  • Linux (Ubuntu 15.04)
  • Hadoop 2.7.2
  • Spark 2.0.2
  • Scala 2.11

Installation steps

  1. Simply clone the repository

    git clone https://github.com/Thomas-George-T/Movies-Analytics-in-Spark-and-Scala.git
    
  2. In the repo, Navigate to Spark RDD, Spark SQL or Spark Dataframe locations as needed.

  3. Run the execute script to view results

    sh execute.sh
    
  4. The execute.sh will pass the scala code through spark-shell and then display the findings in the terminal from the results folder.

Analytical Queries

Spark RDD

  • What are the top 10 most viewed movies?
  • What are the distinct list of genres available?
  • How many movies for each genre?
  • How many movies are starting with numbers or letters (Example: Starting with 1/2/3../A/B/C..Z)?
  • List the latest released movies

Spark SQL

  • Create tables for movies.dat, users.dat and ratings.dat: Saving Tables from Spark SQL
  • Find the list of the oldest released movies.
  • How many movies are released each year?
  • How many number of movies are there for each rating?
  • How many users have rated each movie?
  • What is the total rating for each movie?
  • What is the average rating for each movie?

Spark DataFrames

  • Prepare Movies data: Extracting the Year and Genre from the Text
  • Prepare Users data: Loading a double delimited csv file
  • Prepare Ratings data: Programmatically specifying a schema for the dataframe

Miscellaneous

  • Import Data from URL: Scala
  • Save table without defining DDL in Hive
  • Broadcast Variable example
  • Accumulator example
  • Databricks Community Edition

Note: The results were collected and repartitioned into the same text file: This is not a recommended practice since performance is highly impacted but it is done here for the sake of readability.

Mentions

This project was featured on Data Machina Issue #130 listed at number 3 under ScalaTOR. Thank you for the listing

License

This repository is licensed under Apache License 2.0 - see License for more details