PawMark: Platform For Big Data & AI

GitHub Repo stars

Summary

PawMark is a platform for big data and AI. It is based on Apache Spark and Kubernetes. The platform is designed to be scalable and easy to use. It provides a set of tools for data processing, machine learning, and data visualization.

Architecture

Setup

Docker Compose

Details

Start docker-compose
```
docker-compose up -d
```
Access platform UI
- http://localhost:5001
Use notebook
- Access http://localhost:8888
- Spark session is automatically created
  - Run spark in cell to check the spark session
- Run the following code in the notebook to test the spark session
```
spark.range(0, 5) \
  .write.format("delta").mode("overwrite").saveAsTable("test")
```
Check the history server
- Access http://localhost:18080
- Spark application history / progress can be viewed here
Delta tables
- Use /opt/data/delta-table/ as the root directory for delta tables
Schedule with Airflow
- Access http://localhost:8090
- Use the default username and password to login
- Create a new DAG to schedule the spark job
- Or use the example DAGs in the ./dags folder

MiniKube

TODO

Examples

Basic Analysis on Static Tables

Singapore Resale Flat Prices Analysis
- Notebook
- Data Source

Incremental Pipeline

TODO

Docker Images

WebApp

Dockerfile

Server

Dockerfile

Spark

Dockerfile
Includes
- Spark
- Python

Notebook

Dockerfile
Includes
- Jupyter Notebook
- Spark
- Google Cloud SDK
- GCS Connector
- Pyspark Startup Script
- Notebook Save Hook Function

History Server

Dockerfile
Includes
- Spark
- GCS Connector

Airflow

Dockerfile
Includes
- Python
- Java
- pyspark

Versions

Details

Component	Version
Scala	2.12
Java	17
Python	3.11
IPython	8.16.1
Apache Spark	3.5.0
Delta Lake	3.0.0
Airflow	2.9.1
Postgres	13
React	18.3.1

License

This project is licensed under the terms of the Apache-2.0 license.

DataPulse
DataPulse copied to clipboard

Metadata

PawMark: Platform For Big Data & AI

Summary