DataPulse
DataPulse copied to clipboard
DataPulse is a platform for developers to build, schedule and monitor data pipelines.
PawMark: Platform For Big Data & AI
Summary
PawMark is a platform for big data and AI. It is based on Apache Spark and Kubernetes. The platform is designed to be scalable and easy to use. It provides a set of tools for data processing, machine learning, and data visualization.
Setup
Docker Compose
Details
-
Start docker-compose
docker-compose up -d -
Access platform UI
- http://localhost:5001
-
Use notebook
- Access http://localhost:8888
- Spark session is automatically created
- Run
sparkin cell to check the spark session
- Run
- Run the following code in the notebook to test the spark session
spark.range(0, 5) \ .write.format("delta").mode("overwrite").saveAsTable("test")
-
Check the history server
- Access http://localhost:18080
- Spark application history / progress can be viewed here
-
Delta tables
- Use
/opt/data/delta-table/as the root directory for delta tables
- Use
-
Schedule with Airflow
- Access http://localhost:8090
- Use the default username and password to login
- Create a new DAG to schedule the spark job
- Or use the example DAGs in the
./dagsfolder
MiniKube
- TODO
Examples
Basic Analysis on Static Tables
- Singapore Resale Flat Prices Analysis
- Notebook
- Data Source
Incremental Pipeline
- TODO
Docker Images
Notebook
- Dockerfile
- Includes
- Jupyter Notebook
- Spark
- Google Cloud SDK
- GCS Connector
- Pyspark Startup Script
- Notebook Save Hook Function
Versions
Details
| Component | Version |
|---|---|
| Scala | 2.12 |
| Java | 17 |
| Python | 3.11 |
| IPython | 8.16.1 |
| Apache Spark | 3.5.0 |
| Delta Lake | 3.0.0 |
| Airflow | 2.9.1 |
| Postgres | 13 |
| React | 18.3.1 |
License
This project is licensed under the terms of the Apache-2.0 license.