ms-build-e2e-ml-bigdata
ms-build-e2e-ml-bigdata copied to clipboard
This repository contains tutorials and resources for you to reproduce Microsoft Build 2020 session - Building an End-to-End ML Pipeline for Big Data
MS-Build 2020: Building an End-to-End ML Pipeline for Big Data
This repo holds information and resources for you to create the Microsoft Build 2020 - Building End-to-End Machine Learning pipelines for Big Data Session demo.
Prerequisites:
- Azure account
- Eventhubs
- Azure Databricks
- Azure Machine Learning
- Azure KeyVault
- Kubernetes Environment / Azure Container Instance
Data Flow
- Ingest stream data into Azure Blob storage with Event hubs and Azure Databricks.
- Preprocess the data to fit our schema - Apache Spark.
- Save the data in parquet format - in raw storage directory.
- Merge Batch(historical) and Stream(new) data with Apache Spark - save in preprocessed storage directory.
- Create multiple Azure ML(AML) Datasets from Azure Databricks environment - save in refined storage directory.
- Use Azure Machine Learning cluster compute to run multiple experiments on AML Datasets from VSCode.
- Log ML models and ML algorithms parameters using MLflow.
- Serve chosen ML model through Dockerized REST API service on Kubernetes.
Tutorials:
- Ingest Data with Azure Blob and Eventhubs.
- Collect, Analyze and Process Stream data with Azure Databricks and Eventhubs.
- Track and log ML metrics with MLflow and AML.
- Log & Deploy your ML Models to Kubernetes environment.
Q&A
If you have questions/concerns or would like to chat, contact us: