apache_flink_and_iceberg
apache_flink_and_iceberg copied to clipboard
A complete data engineering project demonstrating modern data stack practices with Apache Flink, Iceberg, Trino and Superset
Modern Data Lakehouse with Real time CDC & Data Quality Monitoring
A complete data engineering project demonstrating modern data stack practices with Apache Flink, Iceberg, Trino and Superset.
Project Overview
This project showcases a modern data lakehouse architecture that captures, processes, stores, and monitors data in real-time. Built with industry standard tools and best practices, it demonstrates:
- Real-time Change Data Capture (CDC) from mySQL.
- Stream processing with Apache Flink
- Modern data lake storage with Apache Iceberg format
- High-performance analytics via Trino SQL engine
- Automated data quality monitoring with validation using Soda
- Interactive dashboards through Apache Superset
๐๏ธ Architecture
graph TB
subgraph "Source Systems"
MySQL[MySQL Database<br/>๐ Products & Sales Data]
end
subgraph "Stream Processing"
Flink[Apache Flink<br/>๐ CDC & Stream Processing]
end
subgraph "Data Lake"
Iceberg[Apache Iceberg<br/>๐๏ธ ACID Transactions]
MinIO[MinIO Object Storage<br/>๐พ S3-Compatible Storage]
end
subgraph "Analytics & Quality"
Trino[Trino SQL Engine<br/>โก Fast Distributed Queries]
Soda[Data Quality Monitor<br/>๐ Automated Validation]
end
subgraph "Visualization"
Superset[Apache Superset<br/>๐ Interactive Dashboards]
end
MySQL -->|CDC Stream| Flink
Flink -->|Write| Iceberg
Iceberg -->|Store| MinIO
Iceberg --> Trino
Trino --> Superset
Trino --> Soda
style MySQL fill:#e1f5fe
style Flink fill:#f3e5f5
style Iceberg fill:#e8f5e8
style Trino fill:#fff3e0
style Soda fill:#fce4ec
style Superset fill:#f1f8e9
Key Features
๐ Real-time Data Pipeline
- MySQL CDC Source: Captures all changes (INSERT, UPDATE, DELETE) in real-time
- Apache Flink Streaming: Processes data streams with exactly-once guarantees
- Apache Iceberg: Modern table format with ACID transactions and time travel
- MinIO Storage: High-performance, S3-compatible object storage
๐ Data Quality Assurance
- Automated Quality Checks: 12+ comprehensive validation rules
- Real-time Monitoring: Continuous data quality assessment every 5 minutes
- Business Rule Validation: Custom checks for data integrity and business logic
- Failure Detection: Immediate alerts for data quality issues
๐ Analytics & Visualization
- Trino SQL Engine: Fast, distributed queries across data lake
- Apache Superset: Interactive dashboards and data exploration
- Real-time Insights: Up-to-the-minute analytics on streaming data
๐ Quick Start
Prerequisites
- Docker & Docker Compose
- 8GB+ RAM recommended
- 10GB+ free disk space
1. Launch the Stack
# Clone and start the complete data platform
git clone [email protected]:gordonmurray/apache_flink_and_iceberg.git
cd apache_flink_and_iceberg
# Start all services (takes 2-3 minutes)
docker compose up -d
# Verify all services are running
docker ps
2. Submit Flink CDC Jobs
# Submit the streaming CDC jobs
docker exec jobmanager /opt/flink/bin/sql-client.sh -f /opt/flink/job.sql
docker exec jobmanager /opt/flink/bin/sql-client.sh -f /opt/flink/products_streaming.sql
docker exec jobmanager /opt/flink/bin/sql-client.sh -f /opt/flink/sales_streaming.sql
3. Verify Data Quality
Soda can actively push alerts to channels such as Slack, for this project is just writes to a log.
# Check data quality monitoring results
docker logs soda --tail 30
# Expected output: "๐ All data quality checks PASSED!"
๐ Access Points
| Service | URL | Purpose |
|---|---|---|
| Flink Web UI | http://localhost:8081 | Monitor streaming jobs |
| MinIO Console | http://localhost:9001 | Browse data lake storage |
| Trino UI | http://localhost:8080 | Query execution monitoring |
| Superset | http://localhost:8088 | Data visualization dashboards |
Default Credentials:
- MinIO:
minio/minio123 - Superset:
admin/admin
๐ Demo Scenarios
Scenario 1: Real-time Data Ingestion
# Insert new sales data
docker exec mysql mysql -u root -prootpw -e "
INSERT INTO appdb.sales (product_id, qty, price, sale_ts)
VALUES (1, 5, 29.99, NOW());
"
# Watch data flow through pipeline (30-60 seconds)
docker logs soda --follow
# Query the new data via Trino
docker exec trino trino --execute "
SELECT COUNT(*) as total_sales
FROM iceberg.demo.sales;
"
Scenario 2: Data Quality Validation
# Insert invalid data (negative price)
docker exec mysql mysql -u root -prootpw -e "
INSERT INTO appdb.sales (product_id, qty, price, sale_ts)
VALUES (2, 3, -15.00, NOW());
"
# Watch data quality alerts trigger
docker logs soda --tail 50
# Expected: "โ ๏ธ X checks FAILED - Review data quality issues"
Scenario 3: Analytics & Insights
# Run analytics queries via Trino
docker exec trino trino --execute "
SELECT
p.name as product_name,
COUNT(*) as sales_count,
SUM(s.qty * s.price) as total_revenue
FROM iceberg.demo.sales s
JOIN iceberg.demo.products p ON s.product_id = p.id
GROUP BY p.name
ORDER BY total_revenue DESC;
"
๐ Data Quality Monitoring
The data quality system validates:
Products Table
- โ Data Completeness: No null IDs, SKUs, or names
- โ Uniqueness: Duplicate detection for primary/business keys
- โ Format Validation: SKU pattern compliance (P-XXX format)
- โ Referential Integrity: Valid relationships maintained
Sales Table
- โ Business Rules: Positive quantities and prices only
- โ Data Freshness: Recent transactions within 24 hours
- โ Outlier Detection: Reasonable quantity limits (โค100)
- โ Complete Records: No missing required fields
๐ For detailed usage: See Soda_guide.md
๐ Dashboards & Analytics
Apache Superset provides:
- Sales Performance Dashboards: Revenue trends, top products
- Data Quality Metrics: Real-time quality score tracking
- Operational Monitoring: Pipeline health and throughput
- Interactive Exploration: Ad-hoc data analysis
๐ For dashboard setup: See Superset_guide.md
๐ Project Structure
โโโ docker-compose.yml # Complete stack orchestration
โโโ Dockerfile # Custom Flink image with CDC connectors
โโโ README.md # This comprehensive guide
โโโ Soda_guide.md # Data quality monitoring guide
โโโ Superset_guide.md # Dashboard and visualization guide
โโโ jobs/ # Flink SQL job definitions
โ โโโ job.sql # Main CDC pipeline setup
โ โโโ products_streaming.sql # Products CDC streaming job
โ โโโ sales_streaming.sql # Sales CDC streaming job
โโโ sql/ # Database initialization
โ โโโ init.sql # Sample data and schema
โ โโโ mariadb.cnf # MySQL configuration
โโโ soda/ # Data quality monitoring
โ โโโ data_quality_monitor.py # Custom monitoring solution
โ โโโ configuration.yml # Trino connection settings
โ โโโ checks.yml # Quality check definitions
โ โโโ Dockerfile # Monitoring container setup
โโโ trino/ # Trino configuration
โ โโโ etc/catalog/ # Iceberg catalog configuration
โ โโโ lib/ # MySQL connector JAR
โโโ superset/ # Superset configuration
โ โโโ docker-entrypoint.sh # Initialization script
โ โโโ setup_dashboard.py # Automated dashboard creation
โโโ minio/ # MinIO setup
โโโ create-bucket.sh # Storage bucket initialization
๐งช Testing & Validation
End-to-End Pipeline Test
# 1. Insert test data
docker exec mysql mysql -u root -prootpw -e "
INSERT INTO appdb.products (sku, name) VALUES ('P-999', 'Test Product');
INSERT INTO appdb.sales (product_id, qty, price, sale_ts) VALUES (1, 10, 99.99, NOW());
"
# 2. Verify data pipeline (wait 60 seconds)
sleep 60
# 3. Check data quality
docker exec soda python data_quality_monitor.py
# 4. Query final results
docker exec trino trino --execute "
SELECT COUNT(*) FROM iceberg.demo.products;
SELECT COUNT(*) FROM iceberg.demo.sales;
"
Performance Benchmarks
- Ingestion Rate: 1000+ records/second
- Query Latency: Sub-second for most analytics
- Data Freshness: 30-60 second end-to-end latency
- Quality Validation: 12 checks in <5 seconds
๐ ๏ธ Customization Guide
Adding New Data Sources
- Create CDC connector configuration in
jobs/ - Define Iceberg target tables
- Add quality checks in
soda/data_quality_monitor.py - Update Superset dashboards