apache_flink_and_iceberg icon indicating copy to clipboard operation
apache_flink_and_iceberg copied to clipboard

A complete data engineering project demonstrating modern data stack practices with Apache Flink, Iceberg, Trino and Superset

Modern Data Lakehouse with Real time CDC & Data Quality Monitoring

A complete data engineering project demonstrating modern data stack practices with Apache Flink, Iceberg, Trino and Superset.

Project Overview

This project showcases a modern data lakehouse architecture that captures, processes, stores, and monitors data in real-time. Built with industry standard tools and best practices, it demonstrates:

  • Real-time Change Data Capture (CDC) from mySQL.
  • Stream processing with Apache Flink
  • Modern data lake storage with Apache Iceberg format
  • High-performance analytics via Trino SQL engine
  • Automated data quality monitoring with validation using Soda
  • Interactive dashboards through Apache Superset

๐Ÿ—๏ธ Architecture

graph TB
    subgraph "Source Systems"
        MySQL[MySQL Database<br/>๐Ÿ“Š Products & Sales Data]
    end

    subgraph "Stream Processing"
        Flink[Apache Flink<br/>๐Ÿ”„ CDC & Stream Processing]
    end

    subgraph "Data Lake"
        Iceberg[Apache Iceberg<br/>๐Ÿ—‚๏ธ ACID Transactions]
        MinIO[MinIO Object Storage<br/>๐Ÿ’พ S3-Compatible Storage]
    end

    subgraph "Analytics & Quality"
        Trino[Trino SQL Engine<br/>โšก Fast Distributed Queries]
        Soda[Data Quality Monitor<br/>๐Ÿ” Automated Validation]
    end

    subgraph "Visualization"
        Superset[Apache Superset<br/>๐Ÿ“ˆ Interactive Dashboards]
    end

    MySQL -->|CDC Stream| Flink
    Flink -->|Write| Iceberg
    Iceberg -->|Store| MinIO
    Iceberg --> Trino
    Trino --> Superset
    Trino --> Soda

    style MySQL fill:#e1f5fe
    style Flink fill:#f3e5f5
    style Iceberg fill:#e8f5e8
    style Trino fill:#fff3e0
    style Soda fill:#fce4ec
    style Superset fill:#f1f8e9

Key Features

๐Ÿ”„ Real-time Data Pipeline

  • MySQL CDC Source: Captures all changes (INSERT, UPDATE, DELETE) in real-time
  • Apache Flink Streaming: Processes data streams with exactly-once guarantees
  • Apache Iceberg: Modern table format with ACID transactions and time travel
  • MinIO Storage: High-performance, S3-compatible object storage

๐Ÿ” Data Quality Assurance

  • Automated Quality Checks: 12+ comprehensive validation rules
  • Real-time Monitoring: Continuous data quality assessment every 5 minutes
  • Business Rule Validation: Custom checks for data integrity and business logic
  • Failure Detection: Immediate alerts for data quality issues

๐Ÿ“Š Analytics & Visualization

  • Trino SQL Engine: Fast, distributed queries across data lake
  • Apache Superset: Interactive dashboards and data exploration
  • Real-time Insights: Up-to-the-minute analytics on streaming data

๐Ÿš€ Quick Start

Prerequisites

  • Docker & Docker Compose
  • 8GB+ RAM recommended
  • 10GB+ free disk space

1. Launch the Stack

# Clone and start the complete data platform
git clone [email protected]:gordonmurray/apache_flink_and_iceberg.git
cd apache_flink_and_iceberg

# Start all services (takes 2-3 minutes)
docker compose up -d

# Verify all services are running
docker ps

2. Submit Flink CDC Jobs

# Submit the streaming CDC jobs
docker exec jobmanager /opt/flink/bin/sql-client.sh -f /opt/flink/job.sql
docker exec jobmanager /opt/flink/bin/sql-client.sh -f /opt/flink/products_streaming.sql
docker exec jobmanager /opt/flink/bin/sql-client.sh -f /opt/flink/sales_streaming.sql

3. Verify Data Quality

Soda can actively push alerts to channels such as Slack, for this project is just writes to a log.

# Check data quality monitoring results
docker logs soda --tail 30

# Expected output: "๐ŸŽ‰ All data quality checks PASSED!"

๐ŸŒ Access Points

Service URL Purpose
Flink Web UI http://localhost:8081 Monitor streaming jobs
MinIO Console http://localhost:9001 Browse data lake storage
Trino UI http://localhost:8080 Query execution monitoring
Superset http://localhost:8088 Data visualization dashboards

Default Credentials:

  • MinIO: minio / minio123
  • Superset: admin / admin

๐Ÿ“ˆ Demo Scenarios

Scenario 1: Real-time Data Ingestion

# Insert new sales data
docker exec mysql mysql -u root -prootpw -e "
  INSERT INTO appdb.sales (product_id, qty, price, sale_ts)
  VALUES (1, 5, 29.99, NOW());
"

# Watch data flow through pipeline (30-60 seconds)
docker logs soda --follow

# Query the new data via Trino
docker exec trino trino --execute "
  SELECT COUNT(*) as total_sales
  FROM iceberg.demo.sales;
"

Scenario 2: Data Quality Validation

# Insert invalid data (negative price)
docker exec mysql mysql -u root -prootpw -e "
  INSERT INTO appdb.sales (product_id, qty, price, sale_ts)
  VALUES (2, 3, -15.00, NOW());
"

# Watch data quality alerts trigger
docker logs soda --tail 50

# Expected: "โš ๏ธ X checks FAILED - Review data quality issues"

Scenario 3: Analytics & Insights

# Run analytics queries via Trino
docker exec trino trino --execute "
  SELECT
    p.name as product_name,
    COUNT(*) as sales_count,
    SUM(s.qty * s.price) as total_revenue
  FROM iceberg.demo.sales s
  JOIN iceberg.demo.products p ON s.product_id = p.id
  GROUP BY p.name
  ORDER BY total_revenue DESC;
"

๐Ÿ“Š Data Quality Monitoring

The data quality system validates:

Products Table

  • โœ… Data Completeness: No null IDs, SKUs, or names
  • โœ… Uniqueness: Duplicate detection for primary/business keys
  • โœ… Format Validation: SKU pattern compliance (P-XXX format)
  • โœ… Referential Integrity: Valid relationships maintained

Sales Table

  • โœ… Business Rules: Positive quantities and prices only
  • โœ… Data Freshness: Recent transactions within 24 hours
  • โœ… Outlier Detection: Reasonable quantity limits (โ‰ค100)
  • โœ… Complete Records: No missing required fields

๐Ÿ“‹ For detailed usage: See Soda_guide.md

๐Ÿ“ˆ Dashboards & Analytics

Apache Superset provides:

  • Sales Performance Dashboards: Revenue trends, top products
  • Data Quality Metrics: Real-time quality score tracking
  • Operational Monitoring: Pipeline health and throughput
  • Interactive Exploration: Ad-hoc data analysis

๐Ÿ“‹ For dashboard setup: See Superset_guide.md

๐Ÿ“ Project Structure

โ”œโ”€โ”€ docker-compose.yml          # Complete stack orchestration
โ”œโ”€โ”€ Dockerfile                  # Custom Flink image with CDC connectors
โ”œโ”€โ”€ README.md                   # This comprehensive guide
โ”œโ”€โ”€ Soda_guide.md              # Data quality monitoring guide
โ”œโ”€โ”€ Superset_guide.md          # Dashboard and visualization guide
โ”œโ”€โ”€ jobs/                      # Flink SQL job definitions
โ”‚   โ”œโ”€โ”€ job.sql                # Main CDC pipeline setup
โ”‚   โ”œโ”€โ”€ products_streaming.sql # Products CDC streaming job
โ”‚   โ””โ”€โ”€ sales_streaming.sql    # Sales CDC streaming job
โ”œโ”€โ”€ sql/                       # Database initialization
โ”‚   โ”œโ”€โ”€ init.sql              # Sample data and schema
โ”‚   โ””โ”€โ”€ mariadb.cnf           # MySQL configuration
โ”œโ”€โ”€ soda/                      # Data quality monitoring
โ”‚   โ”œโ”€โ”€ data_quality_monitor.py # Custom monitoring solution
โ”‚   โ”œโ”€โ”€ configuration.yml     # Trino connection settings
โ”‚   โ”œโ”€โ”€ checks.yml            # Quality check definitions
โ”‚   โ””โ”€โ”€ Dockerfile           # Monitoring container setup
โ”œโ”€โ”€ trino/                     # Trino configuration
โ”‚   โ”œโ”€โ”€ etc/catalog/          # Iceberg catalog configuration
โ”‚   โ””โ”€โ”€ lib/                  # MySQL connector JAR
โ”œโ”€โ”€ superset/                  # Superset configuration
โ”‚   โ”œโ”€โ”€ docker-entrypoint.sh # Initialization script
โ”‚   โ””โ”€โ”€ setup_dashboard.py   # Automated dashboard creation
โ””โ”€โ”€ minio/                     # MinIO setup
    โ””โ”€โ”€ create-bucket.sh      # Storage bucket initialization

๐Ÿงช Testing & Validation

End-to-End Pipeline Test

# 1. Insert test data
docker exec mysql mysql -u root -prootpw -e "
  INSERT INTO appdb.products (sku, name) VALUES ('P-999', 'Test Product');
  INSERT INTO appdb.sales (product_id, qty, price, sale_ts) VALUES (1, 10, 99.99, NOW());
"

# 2. Verify data pipeline (wait 60 seconds)
sleep 60

# 3. Check data quality
docker exec soda python data_quality_monitor.py

# 4. Query final results
docker exec trino trino --execute "
  SELECT COUNT(*) FROM iceberg.demo.products;
  SELECT COUNT(*) FROM iceberg.demo.sales;
"

Performance Benchmarks

  • Ingestion Rate: 1000+ records/second
  • Query Latency: Sub-second for most analytics
  • Data Freshness: 30-60 second end-to-end latency
  • Quality Validation: 12 checks in <5 seconds

๐Ÿ› ๏ธ Customization Guide

Adding New Data Sources

  1. Create CDC connector configuration in jobs/
  2. Define Iceberg target tables
  3. Add quality checks in soda/data_quality_monitor.py
  4. Update Superset dashboards