README

README for “Hadoop Data Warehousing with Hive”

Strata + Hadoop World 2012 Tutorial Exercises

Dean Wampler
[email protected]
@thinkBigA

Welcome! Please follow these instructions to download the tutorial presentation and exercises.

About this Hive Tutorial

This Hive Tutorial is adapted from a longer Think Big Academy course on Hive. (The Academy is the education arm of Think Big Analytics.) We offer various public and private courses on Hadoop programming, Hive, Pig, etc. We also provide consulting on Big Data problems and their solutions, especially using Hadoop. If you want to learn more, visit thinkbiganalytics.com or send us email.

We’ll log into Amazon Elastic MapReduce (EMR) clusters[1] to do the exercises. Feel free to pair program with a neighbor, if you want.

NOTE: The exercises should work with any version of Hive, v0.7.1 or later.

Getting Started

Download the following zip file that contains a PDF of the tutorial presentation, the exercises, the data used for the exercises, and a Hive cheat sheet:

Hive Tutorial, Exercises, Data, etc..

Unzip the tutorial.zip in a convenient place on your laptop.

If you are on Windows, you’ll need the ssh client application putty to log into the EMR servers. You can download and install it from here:

Putty Installer.

Manifest for Tutorial Zip File

Item	Whazzat?
`README.html`	What you’re reading!
`ThinkBigAcademy-Hive-Tutorial.pdf`	The tutorial presentation.
`exercises`	The exercises we’ll use. They are also installed on the clusters, but you’ll open them “locally” in an editor, then use copy and paste.
`data`	The data files we’ll use. They are here only for your reference later. We’ll use the copies already on the clusters.
`HiveCheatSheat.html`	A Hive cheat sheet.
`exercises/.hiverc`	Drop this file in the home directory on any machines where you will normally run the `hive` command-line interface (CLI). Hive will run the commands it contains when it starts. This file is a great place to put commands you always run on startup, such as property settings. Already on the cluster.

Log into one of the Amazon Elastic MapReduce Clusters

We have several EMR clusters running and you’ll log into one of them according to the first one or two letters of your last name, using the following table[2]:

Letters	Server Name	JobFlow ID
`A`	`ec2-50-19-185-170.compute-1.amazonaws.com`	`j-1R3E26P0T3IBK`
`Ba - Bh`	`ec2-50-19-185-170.compute-1.amazonaws.com`	`j-1R3E26P0T3IBK`
`Bi - Bz`	`ec2-50-19-185-170.compute-1.amazonaws.com`	`j-1R3E26P0T3IBK`
`Ca - Ch`	`ec2-50-19-185-170.compute-1.amazonaws.com`	`j-1R3E26P0T3IBK`
`Ci - Cz`	`ec2-50-19-185-170.compute-1.amazonaws.com`	`j-1R3E26P0T3IBK`
`D`	`ec2-50-19-185-170.compute-1.amazonaws.com`	`j-1R3E26P0T3IBK`
`E - F`	`ec2-50-19-185-170.compute-1.amazonaws.com`	`j-1R3E26P0T3IBK`
`G`	`ec2-50-19-185-170.compute-1.amazonaws.com`	`j-1R3E26P0T3IBK`
`H`	`ec2-50-19-185-170.compute-1.amazonaws.com`	`j-1R3E26P0T3IBK`
`I - J`	`ec2-50-19-185-170.compute-1.amazonaws.com`	`j-1R3E26P0T3IBK`
`K - L`	`ec2-50-19-185-170.compute-1.amazonaws.com`	`j-1R3E26P0T3IBK`
`Ma - Mh`	`ec2-50-19-185-170.compute-1.amazonaws.com`	`j-1R3E26P0T3IBK`
`Mi - Mz`	`ec2-50-19-185-170.compute-1.amazonaws.com`	`j-1R3E26P0T3IBK`
`N - P`	`ec2-50-19-185-170.compute-1.amazonaws.com`	`j-1R3E26P0T3IBK`
`Q - R`	`ec2-50-19-185-170.compute-1.amazonaws.com`	`j-1R3E26P0T3IBK`
`Sa - Sh`	`ec2-50-19-185-170.compute-1.amazonaws.com`	`j-1R3E26P0T3IBK`
`Si - Sz`	`ec2-50-19-185-170.compute-1.amazonaws.com`	`j-1R3E26P0T3IBK`
`T - V`	`ec2-50-19-185-170.compute-1.amazonaws.com`	`j-1R3E26P0T3IBK`
`Wa - Wh`	`ec2-50-19-185-170.compute-1.amazonaws.com`	`j-1R3E26P0T3IBK`
`Wi - Z`	`ec2-50-19-185-170.compute-1.amazonaws.com`	`j-1R3E26P0T3IBK`

(We’ll explain the JobFlow ID later.)

Once you have picked the correct server, use the following ssh command, for Linux, Mac OSX, or use the equivalent putty command to log into your server. You’ll be user hadoop:

ssh [email protected]

The password is:

strata

Finally, since you are sharing the primary user account on the cluster, create a personal work directory using mkdir for any file editing that you’ll do today. Pick a name for the directory without spaces, i.e., like a typical user name. You will use that same name for another purpose shortly, as we’ll see. After creating it, change to that directory with the cd command:

mkdir myusername
cd myusername

Please don’t break anything! ;^) Remember, you’re sharing this cluster.

Feel free to snoop around if you’re waiting for others. Note that all the Hadoop software is installed in the hadoop user’s $HOME directory, /home/hadoop.

Quick Cheat Sheet on Linux Shell Commands

If you’re not accustomed to the Linux or Mac OSX bash shell, here are a few hints[3]:

Print your current working directory

pwd

List the contents of a directory

Add the -l option to show a longer listing with more information. If you omit the directory, the current directory is used:

ls some-directory
ls -l some-directory

Change to a different directory

Four variants; using i) an absolute path, ii) a subdirectory of the current directory, iii) the parent directory of the current directory, and iv) your home directory:

cd /home/hadoop
cd exercises
cd ..
cd ~

Page through the contents of a file.

Hit the space bar to page, q to quit:

more some-file

Dump the contents without paging

I.e., “concatenate” or “cat” the file:

cat some-file

For More Information

For more information on Amazon Elastic MapReduce commands, see the Quick Reference Guide and the Developer Guide.

For more details on Hive, see Programming Hive or the Hive Wiki.

Visit The AWS EMR Page and the EMR Documentation page for more information about EMR. ↩
I used the following information to determine a good distribution of users across these clusters. Note that these EMR clusters will only be available during the time of the tutorial. ↩
You should learn how to use bash if you want to use Hadoop. ↩

</div>

Hive-Demo
Hive-Demo copied to clipboard

Metadata

README for “Hadoop Data Warehousing with Hive”

Strata + Hadoop World 2012 Tutorial Exercises

About this Hive Tutorial

Getting Started

Manifest for Tutorial Zip File

Log into one of the Amazon Elastic MapReduce Clusters

Quick Cheat Sheet on Linux Shell Commands

Print your current working directory

List the contents of a directory

Change to a different directory

Page through the contents of a file.

Dump the contents without paging

For More Information

← Metadata

Owner

Metadata

Hive-Demo Hive-Demo copied to clipboard

Metadata

README for “Hadoop Data Warehousing with Hive”

Strata + Hadoop World 2012 Tutorial Exercises

About this Hive Tutorial

Getting Started

Manifest for Tutorial Zip File

Log into one of the Amazon Elastic MapReduce Clusters

Quick Cheat Sheet on Linux Shell Commands

Print your current working directory

List the contents of a directory

Change to a different directory

Page through the contents of a file.

Dump the contents without paging

For More Information

← Metadata

Owner

Metadata

Hive-Demo
Hive-Demo copied to clipboard