GitHub GitHub top language GitHub language count GitHub last commit ViewCount

Overview

Solving analytical questions on the semi-structured MovieLens dataset containing a million records using Spark and Scala. This features the use of Spark RDD, Spark SQL and Spark Dataframes executed on Spark-Shell (REPL) using Scala API. We aim to draw useful insights about users and movies by leveraging different forms of Spark APIs.

Major Components

Apache Spark Logo Scala

Environment

Linux (Ubuntu 15.04)
Hadoop 2.7.2
Spark 2.0.2
Scala 2.11

Installation steps

Simply clone the repository

git clone https://github.com/Thomas-George-T/Movies-Analytics-in-Spark-and-Scala.git

In the repo, Navigate to Spark RDD, Spark SQL or Spark Dataframe locations as needed.
Run the execute script to view results
```
sh execute.sh
```
The execute.sh will pass the scala code through spark-shell and then display the findings in the terminal from the results folder.

Analytical Queries

Spark RDD

What are the top 10 most viewed movies?
What are the distinct list of genres available?
How many movies for each genre?
How many movies are starting with numbers or letters (Example: Starting with 1/2/3../A/B/C..Z)?
List the latest released movies

Spark SQL

Create tables for movies.dat, users.dat and ratings.dat: Saving Tables from Spark SQL
Find the list of the oldest released movies.
How many movies are released each year?
How many number of movies are there for each rating?
How many users have rated each movie?
What is the total rating for each movie?
What is the average rating for each movie?

Spark DataFrames

Prepare Movies data: Extracting the Year and Genre from the Text
Prepare Users data: Loading a double delimited csv file
Prepare Ratings data: Programmatically specifying a schema for the dataframe

Miscellaneous

Import Data from URL: Scala
Save table without defining DDL in Hive
Broadcast Variable example
Accumulator example
Databricks Community Edition

Note: The results were collected and repartitioned into the same text file: This is not a recommended practice since performance is highly impacted but it is done here for the sake of readability.

Mentions

This project was featured on Data Machina Issue #130 listed at number 3 under ScalaTOR. Thank you for the listing

License

This repository is licensed under Apache License 2.0 - see License for more details

Movies-Analytics-in-Spark-and-Scala
Movies-Analytics-in-Spark-and-Scala copied to clipboard

Metadata

Overview

Table of Contents

Major Components

Environment

Installation steps

Analytical Queries

Spark RDD

Spark SQL

Spark DataFrames

Miscellaneous

Mentions

License

← Metadata

Owner

Metadata

Movies-Analytics-in-Spark-and-Scala Movies-Analytics-in-Spark-and-Scala copied to clipboard

Metadata

Overview

Table of Contents

Major Components

Environment

Installation steps

Analytical Queries

Spark RDD

Spark SQL

Spark DataFrames

Miscellaneous

Mentions

License

← Metadata

Owner

Metadata

Movies-Analytics-in-Spark-and-Scala
Movies-Analytics-in-Spark-and-Scala copied to clipboard