FootballDataEngineering
FootballDataEngineering copied to clipboard
An end-to-end data engineering pipeline that fetches data from Wikipedia, cleans and transforms it with Apache Airflow and saves it on Azure Data Lake. Other processing takes place on Azure Data Facto...
Football Data Engineering
This Python-based project crawls data from Wikipedia using Apache Airflow, cleans it and pushes it Azure Data Lake for processing.
Table of Contents
- System Architecture
- Requirements
- Getting Started
- Running the Code With Docker
- How It Works
- Video
System Architecture
Requirements
- Python 3.9 (minimum)
- Docker
- PostgreSQL
- Apache Airflow 2.6 (minimum)
Getting Started
-
Clone the repository.
git clone https://github.com/airscholar/FootballDataEngineering.git
-
Install Python dependencies.
pip install -r requirements.txt
Running the Code With Docker
- Start your services on Docker with
docker compose up -d
- Trigger the DAG on the Airflow UI.
How It Works
- Fetches data from Wikipedia.
- Cleans the data.
- Transforms the data.
- Pushes the data to Azure Data Lake.