trafficstars

Data & Machine Learning - The Boring Way

This tutorial walks you through setting up and building a Data Engineering & Machine Learning Platform. The tutorial is designed to explore many different technologies for the similar problems without any bias.

This is not a Production Ready Setup

Target Audience

Data Engineers, Machine Learning Engineer, Data Scientist, SRE, Infrastructure Engineer, Data Analysts, Data Analytics Engineer

Expected Technologies & Workflow

Data Engineering & Analytics

[X] Kubernetes Kind Installation link
[X] MinIO Integrate Object Storage on top of Kubernetes and use minio interface for simulating the s3 link
[X] Apache Airflow on top of Kubernetes & Running an end to end Airflow Workflow using Kubernetes Executor link
[X] Apache Spark Deploy Apache Spark on Kubernetes and run an example link
[ ] Prefect Setup & Running an end to end Workflow
[ ] Dagster Setup & Running an end to end Workflow
[ ] Set up an ETL job running end-2-end on apache airflow. This job contains Spark & Python Operator
[ ] Apache Hive Setting up Hive & Hive Metastore
[ ] Deploy Trino & Open Source Presto and run dana Analytics queries.
[ ] Integrate Superset & Metabase to run visualization. Integrate Presto with the visualization system.
[ ] Open Table Format using Delta
[ ] Open Table Format using Apache Iceberg
[ ] Open Table Format using Apache Hudi
[ ] Metadata Management using Amundsen
[ ] Metadata Management using Datahub
[ ] Setting up Apache Kafka distributed event streaming platform
[ ] Using Spark Structered Streaming to run an end-2-end pipeline over any realtime data sources
[ ] Using Apache Flink to run an end-2-end pipeline over any realtime data sources
[ ] Redpanda, streaming data platform to run similar workflow
[ ] Airbyte Data Integration platform
[ ] Talend UI based Data Integration
[ ] DBT DBT Sql Pipeline to compare with Spark and other tech
[ ] Debezium Change Data Capture using Debezium to sync multiple databases

Monitoring & Observability

[ ] Grafana Setting Up Grafana for Monitoring components. Start with Monitoring Pods
[ ] FluentD logging metrics from pods & interact the same with Monitoring layer
[ ] Setting up a full Monitoring and Alerting Platform & integrate minitoring across other technologies
[ ] Setting up an Observability system

Machine Learning

[ ] Setup Ray for Data Transformations
[ ] Use Scikit-learn for an example ML training
[ ] Setup Argo Pipeline for deploying ML Jobs
[ ] Setup Flyte Orchestrator for pythonic Deployment
[ ] Use Pytorch Lightening for runing ML training
[ ] Use Tensorflow for running ML training
[ ] Setup ML End-2-End Workflow on Flyte
[ ] Deploy MLFlow for ML Model Tracking & Experimentation
[ ] Deploy BentoML For deploying ML Model
[ ] Deploy Sendon Core for ML Model Management
[ ] Integrate MLflow with Seldon Core

Prerequisites

🐳 Docker Installed
kubectl Installed, The Kubernetes command-line tool, kubectl, allows you to run commands against Kubernetes clusters
Lens Installed, UI for Kubernetes.
This is optional, kubectl is enough for getting all relevant stats from kubernetes cluster
Helm The package manager for Kubernetes

Lab Basic Setup

Setting Up Kind
Deleting older Pods PodCleaner

data-machinelearning-the-boring-way
data-machinelearning-the-boring-way copied to clipboard

Metadata

Data & Machine Learning - The Boring Way

Target Audience

Expected Technologies & Workflow

Data Engineering & Analytics

Monitoring & Observability

Machine Learning

Prerequisites

Lab Basic Setup

← Metadata

Owner

Metadata

data-machinelearning-the-boring-way data-machinelearning-the-boring-way copied to clipboard

Metadata

Data & Machine Learning - The Boring Way

Target Audience

Expected Technologies & Workflow

Data Engineering & Analytics

Monitoring & Observability

Machine Learning

Prerequisites

Lab Basic Setup

← Metadata

Owner

Metadata

data-machinelearning-the-boring-way
data-machinelearning-the-boring-way copied to clipboard