AirflowETL icon indicating copy to clipboard operation
AirflowETL copied to clipboard

Blog post on ETL pipelines with Airflow

An Example ETL Pipeline With Airflow

In this blog post I want to go over the operations of data engineering called Extract, Transform, Load (ETL) and show how they can be automated and scheduled using Apache Airflow. You can see the source code for this project here.

Extracting data can be done in a multitude of ways, but one of the most common ways is to query a WEB API. If the query is sucessful, then we will receive data back from the API's server. Often times the data we get back is in the form of JSON. JSON can pretty much be thought of a semi-structured data or as a dictionary where the dictionary keys and values are strings. Since the data is a dictionary of strings this means we must transform it before storing or loading into a database. Airflow is a platform to schedule and monitor workflows and in this post I will show you how to use it to extract the daily weather in New York from the OpenWeatherMap API, convert the temperature to Celsius and load the data in a simple PostgreSQL database.

Requirements

Airflow

Python 2.7

PostgreSQL

psycopg2

SQLAlchemy

SQLAlchemy-Utils

To install the requirements (except for Python and postgres) type:

pip install -r requirements.t

You can see the actual blog post here.