ms-build-e2e-ml-bigdata icon indicating copy to clipboard operation
ms-build-e2e-ml-bigdata copied to clipboard

This repository contains tutorials and resources for you to reproduce Microsoft Build 2020 session - Building an End-to-End ML Pipeline for Big Data​

MS-Build 2020: Building an End-to-End ML Pipeline for Big Data​

This repo holds information and resources for you to create the Microsoft Build 2020 - Building End-to-End Machine Learning pipelines for Big Data Session demo.

Prerequisites:

  1. Azure account
  2. Eventhubs
  3. Azure Databricks
  4. Azure Machine Learning
  5. Azure KeyVault
  6. Kubernetes Environment / Azure Container Instance

Data Flow

  1. Ingest stream data into Azure Blob storage with Event hubs and Azure Databricks.
  2. Preprocess the data to fit our schema - Apache Spark.
  3. Save the data in parquet format - in raw storage directory.
  4. Merge Batch(historical) and Stream(new) data with Apache Spark - save in preprocessed storage directory.
  5. Create multiple Azure ML(AML) Datasets from Azure Databricks environment - save in refined storage directory.
  6. Use Azure Machine Learning cluster compute to run multiple experiments on AML Datasets from VSCode.
  7. Log ML models and ML algorithms parameters using MLflow.
  8. Serve chosen ML model through Dockerized REST API service on Kubernetes.

Tutorials:

Q&A

If you have questions/concerns or would like to chat, contact us: