HDInsight Developer's Guide
This guide is intended to provide a curated set of documentation useful to any developer, data scientist or big data engineer getting started or growing their experience with Azure HDInsight.
The delivery goal of this guide is to package this online format into the format of a digital book.
The table of contents follows, links to new content will open in the same window remaining in GitHub, while links to existing content that will soon be merged with this repo will open the Azure Docs.
Overview
What is Azure HDInsight?
Iterative data exploration
Data Warehouse on demand
ETL at scale
Streaming at scale
Machine learning
Batch & Interactive Processing
Run Custom Programs
Upload Data to HDInsight
Azure HDInsight and Hadoop Architecture
HDInsight Architecture
Hadoop Architecture
Lifecycle of an HDInsight Cluster
High availability model
Capacity planning
Configuring the Cluster
Use SSH with HDInsight
Use SSH tunneling
Use HDInsight in a Virtual Network
Scaling best practices
Configuring Hive and Oozie Metadata Storage
Configuring Identity and Access Controls
Manage authorized Ambari users
Authorize user access to Ranger
Add ACLs at the file and folder levels
Sync users from Azure Active Directory to cluster
Use on-demand HDInsight clusters from Data Factory
Monitoring and managing the HDInsight cluster
Key scenarios to monitor
Monitoring and managing with Ambari
Monitoring with the Ambari REST API
Administering HDInsight using the Azure Portal
Manage configurations with Ambari
Manage cluster logs
Adding storage accounts to a running cluster
Use script actions to customize cluster setup
Develop script actions
OS patching for HDInsight cluster
Developing Hive applications
Hive and ETL Overview
Connect to Hive with JDBC or ODBC
Writing Hive applications using Java
Writing Hive applications using Python
Creating user defined functions
Process and analyze JSON documents with Hive
Hive samples
Query Hive using Excel
Analyze stored sensor data using Hive
Analyze stored tweets using beeline and Hive
Analyze flight delay data with Hive
Analyze website logs with Hive
Developing Spark applications
Spark Scenarios
Use Spark with HDInsight
Use Spark SQL with HDInsight
Run Spark from the Shell
Use Spark with notebooks
Use Zeppelin notebooks with Spark
Use Jupyter notebook with Spark
Use external packages with Jupyter using cell magic
Use external packages with Jupyter using script action
Use Spark with IntelliJ
Create apps using the Azure Toolkit for IntelliJ
Debug jobs remotely with IntelliJ
Spark samples
Analyze Application Insights telemetry with Spark
Analyze website logs with Spark SQL
Developing Spark ML applications
Creating Spark ML Pipelines
Creating Spark ML models in notebooks
Deep Learning with Spark
Use Caffe for deep learning with Spark
Developing R scripts on HDInsight
What is R Server?
Selecting a compute context
Analyze data from Azure Storage and Data Lake Store using R
Submit jobs from Visual Studio Tools for R
Submit R jobs from R Studio Server
Developing Spark Streaming applications
What is Spark Streaming (DStreams)?
What is Spark Structured Streaming?
Use Spark DStreams to process events from Kafka
Use Spark DStreams to process events from Event Hubs
Use Spark Structured Streaming to process events from Kafka
Use Spark Structured Streaming to process events from Event Hubs
Creating highly available Spark Streaming jobs in YARN
Creating Spark Streaming jobs with exactly once event processing guarantees
Optimizing Spark Performance
Optimizing and configuring Spark jobs for performance
Configuring Spark settings
Choosing between Spark RDD, dataframe and dataset
Use HBase
What is HBase?
Understanding the HBase storage options
Using the HBase shell
Using the HBase REST SDK
Configure HBase backup and replication
Using Spark with HBase
Monitor HBase with OMS
Use Phoenix with HBase on HDInsight
Phoenix in HDInsight
Get started using Phoenix with SQLLine
Bulk Loading with Phoenix with psql
Using Spark with Phoenix
Using the Phoenix Query Server REST SDK
Phoenix performance monitoring
Apache Open Source Ecosystem
Install HDInsight apps
Install and use Dataiku
Install and use Datameer
Install and use H2O
Install and use Streamsets
Install and use Cask
Advanced Scenarios and Deep Dives
Advanced Analytics Deep Dive
ETL Deep Dive
Operationalize Data Pipelines with Oozie
Troubleshooting
Troubleshooting a failed or slow HDInsight cluster
Debug jobs by analyzing HDInsight logs
Debug Tez jobs using Hive views in Ambari
Common problems FAQ