Data-Engineering-Nanodegree icon indicating copy to clipboard operation
Data-Engineering-Nanodegree copied to clipboard

This repository holds the python files and notebooks associated with the Udacity Data Engineering Nanodegree.

Language Contributors Forks Stargazers Issues MIT License LinkedIn


Data Engineering Nanodegree

Udacity Nanodegree
Explore the repository»

postgres, cassandra, aws, redshift, s3, emr, spark, airflow, ETL, ELT, data modelling, database schema, data warehousing, data lakes, data engineering, udacity

About The Nanodegree

Data engineers are responsible for making data accessible to all the people who use it across an organization. That could mean creating a data warehouse for the analytics team, building a data pipeline for a front-end application, or summarizing massive datasets to be more user-friendly.

Certificate

certificate

Program Details

During this program, we will complete four courses and five projects. Throughout the projects, we will play the part of a data engineer at a music streaming company. We will work with the same type of data in each project, but with increasing data volume, velocity, and complexity. Here’s a course-by- course breakdown.

Course 1 – Data Modeling

In this course, we will learn to create relational and NoSQL data models to fit the diverse needs of data consumers. In the project, we will build SQL (Postgres) and NoSQL (Apache Cassandra) data models using user activity data for a music streaming app.

Associated notebooks for this course can be found here.

Project 1 can be found here.

Project 2 can be found here.

Course 2 – Cloud Data Warehouses

In this course, we will learn to create cloud-based data warehouses. In the project, we will build an ELT pipeline that extracts data from Amazon S3, stages it in Amazon Redshift, and transforms it into a set of dimensional tables.

Associated notebooks for this course can be found here.

Project 3 can be found here.

Course 3 – Data Lakes with Apache Spark

In this course, we will learn more about the big data ecosystem, how to work with massive datasets with Apache Spark, and how to store big data in a data lake. In the project, we will build an ETL pipeline for a data lake using Apache Spark and S3.

Associated notebooks for this course can be found here.

Project 4 can be found here.

Course 4 – Data Pipelines with Apache Airflow

In this course, we will learn to schedule, automate, and monitor data pipelines using Apache Airflow. In the project, they’ll continue your work on the music streaming company’s data infrastructure by creating and automating a set of data pipelines.

Associated notebooks for this course can be found here.

Project 5 can be found here.

Capstone Project

In the Capstone project, we combine Twitter data, World happiness index data and Earth surface temperature data data to explore whether there is any correlation between the above. The Twitter data is dynamic and the other two dataset are static in nature. The general idea of this project is to extract Twitter data, analyze its sentiment and use the resulting data to gain insights with the other datasets.

Capstone Project can be found here.

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Vineeth S - [email protected]

Project Link: https://github.com/vineeths96/Data-Engineering-Nanodegree