data-machinelearning-the-boring-way
data-machinelearning-the-boring-way copied to clipboard
Build & Learn Data Engineering,Machine Learning over Kubernetes. No Shortcut approach.
trafficstars
Data & Machine Learning - The Boring Way
This tutorial walks you through setting up and building a Data Engineering & Machine Learning Platform. The tutorial is designed to explore many different technologies for the similar problems without any bias.
This is not a Production Ready Setup
Target Audience
Data Engineers, Machine Learning Engineer, Data Scientist, SRE, Infrastructure Engineer, Data Analysts, Data Analytics Engineer
Expected Technologies & Workflow
Data Engineering & Analytics
- [X] Kubernetes Kind Installation link
- [X] MinIO Integrate Object Storage on top of Kubernetes and use minio interface for simulating the s3 link
- [X] Apache Airflow on top of Kubernetes & Running an end to end Airflow Workflow using Kubernetes Executor link
- [X] Apache Spark Deploy Apache Spark on Kubernetes and run an example link
- [ ] Prefect Setup & Running an end to end Workflow
- [ ] Dagster Setup & Running an end to end Workflow
- [ ] Set up an ETL job running end-2-end on apache airflow. This job contains Spark & Python Operator
- [ ] Apache Hive Setting up Hive & Hive Metastore
- [ ] Deploy Trino & Open Source Presto and run dana Analytics queries.
- [ ] Integrate Superset & Metabase to run visualization. Integrate Presto with the visualization system.
- [ ] Open Table Format using Delta
- [ ] Open Table Format using Apache Iceberg
- [ ] Open Table Format using Apache Hudi
- [ ] Metadata Management using Amundsen
- [ ] Metadata Management using Datahub
- [ ] Setting up Apache Kafka distributed event streaming platform
- [ ] Using Spark Structered Streaming to run an end-2-end pipeline over any realtime data sources
- [ ] Using Apache Flink to run an end-2-end pipeline over any realtime data sources
- [ ] Redpanda, streaming data platform to run similar workflow
- [ ] Airbyte Data Integration platform
- [ ] Talend UI based Data Integration
- [ ] DBT DBT Sql Pipeline to compare with Spark and other tech
- [ ] Debezium Change Data Capture using Debezium to sync multiple databases
Monitoring & Observability
- [ ] Grafana Setting Up Grafana for Monitoring components. Start with Monitoring Pods
- [ ] FluentD logging metrics from pods & interact the same with Monitoring layer
- [ ] Setting up a full Monitoring and Alerting Platform & integrate minitoring across other technologies
- [ ] Setting up an Observability system
Machine Learning
- [ ] Setup Ray for Data Transformations
- [ ] Use Scikit-learn for an example ML training
- [ ] Setup Argo Pipeline for deploying ML Jobs
- [ ] Setup Flyte Orchestrator for pythonic Deployment
- [ ] Use Pytorch Lightening for runing ML training
- [ ] Use Tensorflow for running ML training
- [ ] Setup ML End-2-End Workflow on Flyte
- [ ] Deploy MLFlow for ML Model Tracking & Experimentation
- [ ] Deploy BentoML For deploying ML Model
- [ ] Deploy Sendon Core for ML Model Management
- [ ] Integrate MLflow with Seldon Core
Prerequisites
- 🐳 Docker Installed
- kubectl Installed, The Kubernetes command-line tool, kubectl, allows you to run commands against Kubernetes clusters
- Lens Installed, UI for Kubernetes.
This is optional, kubectl is enough for getting all relevant stats from kubernetes cluster - Helm The package manager for Kubernetes
Lab Basic Setup
- Setting Up Kind
- Deleting older Pods PodCleaner