DSE230_Data_Analysis_Using_Hadoop_and_Spark_UCSD icon indicating copy to clipboard operation
DSE230_Data_Analysis_Using_Hadoop_and_Spark_UCSD copied to clipboard

Map-reduce, streaming analysis, and external memory algorithms and their implementation using the Hadoop and its eco-system: HBase, Hive, Pig and Spark. The class will include assignment of analyzing...

DSE230_Data_Analysis_Using_Hadoop_and_Spark_UCSD

Map-reduce, streaming analysis, and external memory algorithms and their implementation using the Hadoop and its eco-system: HBase, Hive, Pig and Spark. The class will include assignment of analyzing large existing databases.

Spark Installation (Python)

Operating System Blog Post Youtube Video
Mac Install Spark on Mac Youtube Video
Ubuntu Install Spark on Ubuntu Youtube Video
Windows Install Spark on Windows Youtube Video

Section 1: Distributed computation using Map Reduce

  • map-reduce
  • counting words example, loading, processing, collecting.
  • The work environment: Notebooks, markdown, code cells, display cells, S3, passwords and Vault, github.
  • the memory hierarchy, S3 File, SQL tables, data frames / RDD, Parquet files.

Section 2: Analysis based on squared error:

  • Built-in PCA: https://github.com/apache/spark/blob/master/examples/src/main/python/ml/pca_example.py
  • Built-in Regression
    • Guide: http://spark.apache.org/docs/latest/mllib-linear-methods.html#regression
    • Python API: http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.regression
    • Example Code: https://github.com/apache/spark/blob/master/examples/src/main/python/ml/linear_regression_with_elastic_net.py
  • PCA with missing values
  • Mahalanobis Distance
  • K-means
  • Compressed representation and reconstruction

Section 3: Classification:

  • Logistic regression
    • https://github.com/apache/spark/blob/master/examples/src/main/python/ml/logistic_regression_with_elastic_net.py
  • Tree-based regression
    • https://github.com/apache/spark/blob/master/examples/src/main/python/mllib/decision_tree_regression_example.py
  • Ensamble methods for classification
    • Random forests: https://github.com/apache/spark/blob/master/examples/src/main/python/mllib/random_forest_classification_example.py
    • gradient boosted trees: https://github.com/apache/spark/blob/master/examples/src/main/python/mllib/gradient_boosting_classification_example.py

Section 4: Performance tuning: measuring and tuning spark applications

  • Configuration: http://spark.apache.org/docs/latest/configuration.html
  • Monitoring: http://spark.apache.org/docs/latest/monitoring.html
  • Tuning: http://spark.apache.org/docs/latest/tuning.html

Section 5: Spark Streaming and stochastic gradient descent

  • Streaming: http://spark.apache.org/docs/latest/configuration.html#spark-streaming
  • SGD: http://spark.apache.org/docs/latest/mllib-optimization.html#stochastic-gradient-descent-sgd

Assignments (From Newest to Oldest)

  • [Homework 5 Part 2: Higgs Boson](https://github.com/mGalarnyk/DSE230_Data_Analysis_Using_Hadoop_and_Spark_UCSD/blob/master/Homeworks/HW5/2.Higgs.ipynb)
  • [Homework 5 Part 1: Cover Types](https://github.com/mGalarnyk/DSE230_Data_Analysis_Using_Hadoop_and_Spark_UCSD/blob/master/Homeworks/HW5/1.CoverType.ipynb)
  • [Homework 3 Part 2: Reconstruction of Plots](https://github.com/mGalarnyk/DSE230_Data_Analysis_Using_Hadoop_and_Spark_UCSD/blob/master/Homeworks/2.Reconstruction-HW-Copy.ipynb)
  • [Homework 3 Part 1: PCA analysis](https://github.com/mGalarnyk/DSE230_Data_Analysis_Using_Hadoop_and_Spark_UCSD/blob/master/Homeworks/1.PCA_analysis-HW-Copy.ipynb)
  • [Homework 2](https://github.com/mGalarnyk/DSE230_Data_Analysis_Using_Hadoop_and_Spark_UCSD/blob/master/Homeworks/HW-2.ipynb)
  • [Homework 1: Spark Moby Dick N Grams](https://github.com/mGalarnyk/DSE230_Data_Analysis_Using_Hadoop_and_Spark_UCSD/blob/master/Submissions/HW-1_MichaelGalarnyk.py)
  • Notes

  • [Timing for Regex vs string.translate and string.replace](https://github.com/mGalarnyk/DSE230_Data_Analysis_Using_Hadoop_and_Spark_UCSD/blob/master/Timing_Regex_Translate_Replace_Join.ipynb)