hdinsight-docs icon indicating copy to clipboard operation
hdinsight-docs copied to clipboard

HDInsight Developer's Guide

HDInsight Developer's Guide

This guide is intended to provide a curated set of documentation useful to any developer, data scientist or big data engineer getting started or growing their experience with Azure HDInsight.

The delivery goal of this guide is to package this online format into the format of a digital book.

The table of contents follows, links to new content will open in the same window remaining in GitHub, while links to existing content that will soon be merged with this repo will open the Azure Docs.

Overview

What is Azure HDInsight?

Iterative data exploration

Data Warehouse on demand

ETL at scale

Streaming at scale

Machine learning

Batch & Interactive Processing

Run Custom Programs

Upload Data to HDInsight

Azure HDInsight and Hadoop Architecture

HDInsight Architecture

Hadoop Architecture

Lifecycle of an HDInsight Cluster

High availability model

Capacity planning

Configuring the Cluster

Use SSH with HDInsight

Use SSH tunneling

Use HDInsight in a Virtual Network

Scaling best practices

Configuring Hive and Oozie Metadata Storage

Configuring Identity and Access Controls

Manage authorized Ambari users

Authorize user access to Ranger

Add ACLs at the file and folder levels

Sync users from Azure Active Directory to cluster

Use on-demand HDInsight clusters from Data Factory

Monitoring and managing the HDInsight cluster

Key scenarios to monitor

Monitoring and managing with Ambari

Monitoring with the Ambari REST API

Administering HDInsight using the Azure Portal

Manage configurations with Ambari

Manage cluster logs

Adding storage accounts to a running cluster

Use script actions to customize cluster setup

Develop script actions

OS patching for HDInsight cluster

Developing Hive applications

Hive and ETL Overview

Connect to Hive with JDBC or ODBC

Writing Hive applications using Java

Writing Hive applications using Python

Creating user defined functions

Process and analyze JSON documents with Hive

Hive samples

Query Hive using Excel

Analyze stored sensor data using Hive

Analyze stored tweets using beeline and Hive

Analyze flight delay data with Hive

Analyze website logs with Hive

Developing Spark applications

Spark Scenarios

Use Spark with HDInsight

Use Spark SQL with HDInsight

Run Spark from the Shell

Use Spark with notebooks

Use Zeppelin notebooks with Spark

Use Jupyter notebook with Spark

Use external packages with Jupyter using cell magic

Use external packages with Jupyter using script action

Use Spark with IntelliJ

Create apps using the Azure Toolkit for IntelliJ

Debug jobs remotely with IntelliJ

Spark samples

Analyze Application Insights telemetry with Spark

Analyze website logs with Spark SQL

Developing Spark ML applications

Creating Spark ML Pipelines

Creating Spark ML models in notebooks

Deep Learning with Spark

Use Caffe for deep learning with Spark

Developing R scripts on HDInsight

What is R Server?

Selecting a compute context

Analyze data from Azure Storage and Data Lake Store using R

Submit jobs from Visual Studio Tools for R

Submit R jobs from R Studio Server

Developing Spark Streaming applications

What is Spark Streaming (DStreams)?

What is Spark Structured Streaming?

Use Spark DStreams to process events from Kafka

Use Spark DStreams to process events from Event Hubs

Use Spark Structured Streaming to process events from Kafka

Use Spark Structured Streaming to process events from Event Hubs

Creating highly available Spark Streaming jobs in YARN

Creating Spark Streaming jobs with exactly once event processing guarantees

Optimizing Spark Performance

Optimizing and configuring Spark jobs for performance

Configuring Spark settings

Choosing between Spark RDD, dataframe and dataset

Use HBase

What is HBase?

Understanding the HBase storage options

Using the HBase shell

Using the HBase REST SDK

Configure HBase backup and replication

Using Spark with HBase

Monitor HBase with OMS

Use Phoenix with HBase on HDInsight

Phoenix in HDInsight

Get started using Phoenix with SQLLine

Bulk Loading with Phoenix with psql

Using Spark with Phoenix

Using the Phoenix Query Server REST SDK

Phoenix performance monitoring

Apache Open Source Ecosystem

Install HDInsight apps

Install and use Dataiku

Install and use Datameer

Install and use H2O

Install and use Streamsets

Install and use Cask

Advanced Scenarios and Deep Dives

Advanced Analytics Deep Dive

ETL Deep Dive

Operationalize Data Pipelines with Oozie

Troubleshooting

Troubleshooting a failed or slow HDInsight cluster

Debug jobs by analyzing HDInsight logs

Debug Tez jobs using Hive views in Ambari

Common problems FAQ