Data-Engineering-Nanodegree
Data-Engineering-Nanodegree copied to clipboard
This repository holds the python files and notebooks associated with the Udacity Data Engineering Nanodegree.
Data Engineering Nanodegree
Udacity Nanodegree
Explore the repository»
postgres, cassandra, aws, redshift, s3, emr, spark, airflow, ETL, ELT, data modelling, database schema, data warehousing, data lakes, data engineering, udacity
About The Nanodegree
Data engineers are responsible for making data accessible to all the people who use it across an organization. That could mean creating a data warehouse for the analytics team, building a data pipeline for a front-end application, or summarizing massive datasets to be more user-friendly.
Certificate

Program Details
During this program, we will complete four courses and five projects. Throughout the projects, we will play the part of a data engineer at a music streaming company. We will work with the same type of data in each project, but with increasing data volume, velocity, and complexity. Here’s a course-by- course breakdown.
Course 1 – Data Modeling
In this course, we will learn to create relational and NoSQL data models to fit the diverse needs of data consumers. In the project, we will build SQL (Postgres) and NoSQL (Apache Cassandra) data models using user activity data for a music streaming app.
Associated notebooks for this course can be found here.
Project 1 can be found here.
Project 2 can be found here.
Course 2 – Cloud Data Warehouses
In this course, we will learn to create cloud-based data warehouses. In the project, we will build an ELT pipeline that extracts data from Amazon S3, stages it in Amazon Redshift, and transforms it into a set of dimensional tables.
Associated notebooks for this course can be found here.
Project 3 can be found here.
Course 3 – Data Lakes with Apache Spark
In this course, we will learn more about the big data ecosystem, how to work with massive datasets with Apache Spark, and how to store big data in a data lake. In the project, we will build an ETL pipeline for a data lake using Apache Spark and S3.
Associated notebooks for this course can be found here.
Project 4 can be found here.
Course 4 – Data Pipelines with Apache Airflow
In this course, we will learn to schedule, automate, and monitor data pipelines using Apache Airflow. In the project, they’ll continue your work on the music streaming company’s data infrastructure by creating and automating a set of data pipelines.
Associated notebooks for this course can be found here.
Project 5 can be found here.
Capstone Project
In the Capstone project, we combine Twitter data, World happiness index data and Earth surface temperature data data to explore whether there is any correlation between the above. The Twitter data is dynamic and the other two dataset are static in nature. The general idea of this project is to extract Twitter data, analyze its sentiment and use the resulting data to gain insights with the other datasets.
Capstone Project can be found here.
License
Distributed under the MIT License. See LICENSE
for more information.
Contact
Vineeth S - [email protected]
Project Link: https://github.com/vineeths96/Data-Engineering-Nanodegree