open_options_chains
open_options_chains copied to clipboard
A data pipeline solution for collecting options data at scale.
Open Options Chains
A Data Pipeline Solution for Collecting Options Data at Scale
This repository contains the code to implement my Airflow/PostgreSQL solution for managing a data pipeline for collection options data. I have written a series of posts detailing the setup:
- Database Design
- Foundational ETL Code
- Getting Started with Airflow
- Building the DAG
- Containerising the Pipeline
Pre-requisites
- Operating system: Linux (I'm using Ubuntu 20.04)
- Docker
- Docker Compose
Installation
If you have Git installed, clone the repository to your local machine.
git clone https://github.com/chrischow/open_options_chains.git
Otherwise, you may download the code and unzip the contents in a folder of your choice. Then, cd into that directory. You'll see:
dags: Contains the DAGs you'll need fordb: Contains a script to initialise the Postgres databaselogs: Contains logs from your Airflow instancepgdata: Contains data from Postgresscripts: Contains a script to initialise Airflowdocker-compose.yml: File that defines the services, networks, and volumes for our app
Then, change the required variables in the .env file. Note that when you edit the POSTGRES_* variables, you must change the associated variables in AIRFLOW__CORE__SQL_ALCHEMY_CONN. Do remember to input your TD Ameritrade (TDA) API key.
# Postgres
POSTGRES_USER=airflow
POSTGRES_PASSWORD=airflow
POSTGRES_DB=airflow
APP_DB_USER=openoptions
APP_DB_PASS=openoptions
APP_DB_NAME=optionsdata
# Airflow
APP_AIRFLOW_USERNAME=admin
APP_AIRFLOW_PASSWORD=password
AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql+psycopg2://airflow:airflow@postgres/airflow
AIRFLOW__CORE__EXECUTOR=LocalExecutor
# TDA API
API_KEY=<your TDA API key>
Initialisation
Next, we'll initialise Airflow with the command below. This will:
- Start a PostgreSQL container
- Initialise the Airflow Metadata Database and create an Airflow account
docker-compose up airflow-init
The airflow-init container will shut down after the initialisation is complete. The postgres container will continue running. You can stop it with docker-compose stop or docker stop <container id> if you don't need it.
Usage
Launch
Launch Airflow with the command below. This will:
- Start a PostgreSQL container if it's not already running
- Start the Airflow Scheduler container
- Start the Airflow Webserver container
docker-compose up
In a browser, navigate to http://localhost:8088 to access the Airflow UI. For the first time login, use the credentials APP_AIRFLOW_USERNAME and
APP_AIRFLOW_PASSWORD that you set in the .env file.
Shutdown
Use the following command:
docker-compose stop
Extracting Data
Export to CSV
You can extract the data in CSV format via the command line using the commands below:
# Check Postgres container ID
docker ps
# Go to folder to store files
cd <directory-to-store-csv-files>
# Extract CSV file
docker exec -t <first-few-characters-of-docker-container> psql -U airflow -d optionsdata -c "COPY table_name to STDOUT WITH CSV HEADER" > "filename.csv"
Connect to Postgres Instance
Alternatively, you can connect to the Postgres database running in the container. Simply use the database name and login credentials set in the .env file.
import psycopg2 as pg2
# Connect to database
conn = pg2.connect(host='localhost', database='optionsdata', user='airflow', password='airflow', port='5432')
Full Reset - Take Caution!
To restart the environment:
# Completely shut down the app
docker-compose down
# Remove contents of mounted volumes
rm -rf logs/*
rm -rf pgdata/*