big-data-mapreduce-course
big-data-mapreduce-course copied to clipboard
Big Data Modeling, MapReduce, Spark, PySpark @ Santa Clara University
Big Data Modeling & Analytics
Santa Clara University

Fall Quarter 2022
Course Information:
- Graduate School, Leavey School of Business
- Department of Information Systems & Analytics
- Course MSIS 2627: Big Data Modeling & Analytics
- Big-Data-MapReduce Course @ Santa Clara University
- Class meeting dates:
- Start: September 9, 2022
- End: December 9, 2022
- Class hours:
- Tuesday 5:45 PM - 7:20 PM PST (TBDL/online/via Zoom)
- Thursday 5:45 PM - 7:20 PM PST (TBDL/online/via Zoom)
- Instructor: Mahmoud Parsian
- Class room: Lucas Hall 210
- Office: 216AA, 2nd Floor, Lucas Hall (not used due to covid-19)
- Office Hours: TBDL (or by appointment)
- Office Hours ethics: if you are planning to attend an office hour, then you should send me an email
Instructor:
Big Data Modeling Class Web Site
Required Books
-
1.
Data Algorithms with Spark by Mahmoud Parsian -
2.
Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer
Required Introduction to MapReduce and Spark
- 1. A Very Brief Introduction to MapReduce by Diana MacLean
- 2. Introduction to MapReduce by Mahmoud Parsian
- 3. MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat
Additional Optional Books and References
- 1. PySpark Algorithms by Mahmoud Parsian
- 2. Source code @github.com -- PySpark Algorithms by Mahmoud Parsian
- 3. Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, Jeffrey D. Ullman
- 4. Big Data Now -- book
- 5. Designing Good Mapreduce Algorithms by Ullman
- 6: Bigtable: A Distributed Storage System for Structured Data
- 7. Relational Algebra and MapReduce
- 8. MapReduce examples
- 9. MapReduce and relational algebra
- 10. Spark Streaming Tutorial
- 11. Billion Taxi Rides on Amazon Athena
Required Software:
- Apache Spark Site
- Apache Spark Download, Use version 3.2.1
Syllabus, Fall Quarter 2022
Grading and Class Conduct
Midterm Exam:
🍏 Date: Tuesday, October 25, 2022
🍏 Time: 5:45 PM - 7:20 PM PST
🍏 Midterm exam is closed book/notes/friends/internet/phone/software
Final Exam:
🍎 Date: December 6-9, 2022 (TBDL)
🍎 Time: 5:30 PM - 7:30 PM PST
🍎 Final exam is closed book/notes/friends/internet/phone/software
Course Description (High-Level)
The main focus of this class is to cover the following concepts:
- Concepts of Big Data
- Distributed File Systems
- Distributed Computing
- Distributed and Parallel Algorithms
- MapReduce Paradigm
- MapReduce Algorithms
- Scale-out Architectures (using Hadoop, Spark, PySpark)
- Apache Spark
- Use Spark, Py-Spark, and Python to teach MapReduce and distributed computing
- SQL for NoSQL Data, How?
- Amazon Athena
- Amazon Athena, S3, Data Partitioning
Mahmoud Parsian's Latest Books:
Data Algorithms with Spark

PySpark Algorithms

Data Algorithms
