DSE230_Data_Analysis_Using_Hadoop_and_Spark_UCSD
DSE230_Data_Analysis_Using_Hadoop_and_Spark_UCSD copied to clipboard
Map-reduce, streaming analysis, and external memory algorithms and their implementation using the Hadoop and its eco-system: HBase, Hive, Pig and Spark. The class will include assignment of analyzing...
DSE230_Data_Analysis_Using_Hadoop_and_Spark_UCSD
Map-reduce, streaming analysis, and external memory algorithms and their implementation using the Hadoop and its eco-system: HBase, Hive, Pig and Spark. The class will include assignment of analyzing large existing databases.
Spark Installation (Python)
Operating System | Blog Post | Youtube Video |
---|---|---|
Mac | Install Spark on Mac | Youtube Video |
Ubuntu | Install Spark on Ubuntu | Youtube Video |
Windows | Install Spark on Windows | Youtube Video |
Section 1: Distributed computation using Map Reduce
- map-reduce
- counting words example, loading, processing, collecting.
- The work environment: Notebooks, markdown, code cells, display cells, S3, passwords and Vault, github.
- the memory hierarchy, S3 File, SQL tables, data frames / RDD, Parquet files.
Section 2: Analysis based on squared error:
- Built-in PCA: https://github.com/apache/spark/blob/master/examples/src/main/python/ml/pca_example.py
- Built-in Regression
- Guide: http://spark.apache.org/docs/latest/mllib-linear-methods.html#regression
- Python API: http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.regression
- Example Code: https://github.com/apache/spark/blob/master/examples/src/main/python/ml/linear_regression_with_elastic_net.py
- PCA with missing values
- Mahalanobis Distance
- K-means
- Compressed representation and reconstruction
Section 3: Classification:
- Logistic regression
- https://github.com/apache/spark/blob/master/examples/src/main/python/ml/logistic_regression_with_elastic_net.py
- Tree-based regression
- https://github.com/apache/spark/blob/master/examples/src/main/python/mllib/decision_tree_regression_example.py
- Ensamble methods for classification
- Random forests: https://github.com/apache/spark/blob/master/examples/src/main/python/mllib/random_forest_classification_example.py
- gradient boosted trees: https://github.com/apache/spark/blob/master/examples/src/main/python/mllib/gradient_boosting_classification_example.py
Section 4: Performance tuning: measuring and tuning spark applications
- Configuration: http://spark.apache.org/docs/latest/configuration.html
- Monitoring: http://spark.apache.org/docs/latest/monitoring.html
- Tuning: http://spark.apache.org/docs/latest/tuning.html
Section 5: Spark Streaming and stochastic gradient descent
- Streaming: http://spark.apache.org/docs/latest/configuration.html#spark-streaming
- SGD: http://spark.apache.org/docs/latest/mllib-optimization.html#stochastic-gradient-descent-sgd