Hive-Demo
Hive-Demo copied to clipboard
Following along with the Hive tutorial at StrataConf / HadoopWorld
![]()
README for “Hadoop Data Warehousing with Hive”
Strata + Hadoop World 2012 Tutorial Exercises
Dean Wampler
[email protected]
@thinkBigA
Welcome! Please follow these instructions to download the tutorial presentation and exercises.
About this Hive Tutorial
This Hive Tutorial is adapted from a longer Think Big Academy course on Hive. (The Academy is the education arm of Think Big Analytics.) We offer various public and private courses on Hadoop programming, Hive, Pig, etc. We also provide consulting on Big Data problems and their solutions, especially using Hadoop. If you want to learn more, visit thinkbiganalytics.com or send us email.
We’ll log into Amazon Elastic MapReduce (EMR) clusters[1] to do the exercises. Feel free to pair program with a neighbor, if you want.
NOTE: The exercises should work with any version of Hive, v0.7.1 or later.
Getting Started
Download the following zip file that contains a PDF of the tutorial presentation, the exercises, the data used for the exercises, and a Hive cheat sheet:
Unzip the tutorial.zip in a convenient place on your laptop.
If you are on Windows, you’ll need the ssh client application putty to log into the EMR servers. You can download and install it from here:
Manifest for Tutorial Zip File
| Item | Whazzat? |
|---|---|
README.html |
What you’re reading! |
ThinkBigAcademy-Hive-Tutorial.pdf |
The tutorial presentation. |
exercises |
The exercises we’ll use. They are also installed on the clusters, but you’ll open them “locally” in an editor, then use copy and paste. |
data |
The data files we’ll use. They are here only for your reference later. We’ll use the copies already on the clusters. |
HiveCheatSheat.html |
A Hive cheat sheet. |
exercises/.hiverc |
Drop this file in the home directory on any machines where you will normally run the hive command-line interface (CLI). Hive will run the commands it contains when it starts. This file is a great place to put commands you always run on startup, such as property settings. Already on the cluster. |
Log into one of the Amazon Elastic MapReduce Clusters
We have several EMR clusters running and you’ll log into one of them according to the first one or two letters of your last name, using the following table[2]:
| Letters | Server Name | JobFlow ID |
|---|---|---|
A |
ec2-50-19-185-170.compute-1.amazonaws.com |
j-1R3E26P0T3IBK |
Ba - Bh |
ec2-50-19-185-170.compute-1.amazonaws.com |
j-1R3E26P0T3IBK |
Bi - Bz |
ec2-50-19-185-170.compute-1.amazonaws.com |
j-1R3E26P0T3IBK |
Ca - Ch |
ec2-50-19-185-170.compute-1.amazonaws.com |
j-1R3E26P0T3IBK |
Ci - Cz |
ec2-50-19-185-170.compute-1.amazonaws.com |
j-1R3E26P0T3IBK |
D |
ec2-50-19-185-170.compute-1.amazonaws.com |
j-1R3E26P0T3IBK |
E - F |
ec2-50-19-185-170.compute-1.amazonaws.com |
j-1R3E26P0T3IBK |
G |
ec2-50-19-185-170.compute-1.amazonaws.com |
j-1R3E26P0T3IBK |
H |
ec2-50-19-185-170.compute-1.amazonaws.com |
j-1R3E26P0T3IBK |
I - J |
ec2-50-19-185-170.compute-1.amazonaws.com |
j-1R3E26P0T3IBK |
K - L |
ec2-50-19-185-170.compute-1.amazonaws.com |
j-1R3E26P0T3IBK |
Ma - Mh |
ec2-50-19-185-170.compute-1.amazonaws.com |
j-1R3E26P0T3IBK |
Mi - Mz |
ec2-50-19-185-170.compute-1.amazonaws.com |
j-1R3E26P0T3IBK |
N - P |
ec2-50-19-185-170.compute-1.amazonaws.com |
j-1R3E26P0T3IBK |
Q - R |
ec2-50-19-185-170.compute-1.amazonaws.com |
j-1R3E26P0T3IBK |
Sa - Sh |
ec2-50-19-185-170.compute-1.amazonaws.com |
j-1R3E26P0T3IBK |
Si - Sz |
ec2-50-19-185-170.compute-1.amazonaws.com |
j-1R3E26P0T3IBK |
T - V |
ec2-50-19-185-170.compute-1.amazonaws.com |
j-1R3E26P0T3IBK |
Wa - Wh |
ec2-50-19-185-170.compute-1.amazonaws.com |
j-1R3E26P0T3IBK |
Wi - Z |
ec2-50-19-185-170.compute-1.amazonaws.com |
j-1R3E26P0T3IBK |
(We’ll explain the JobFlow ID later.)
Once you have picked the correct server, use the following ssh command, for Linux, Mac OSX, or use the equivalent putty command to log into your server. You’ll be user hadoop:
ssh [email protected]
The password is:
strata
Finally, since you are sharing the primary user account on the cluster, create a personal work directory using mkdir for any file editing that you’ll do today. Pick a name for the directory without spaces, i.e., like a typical user name. You will use that same name for another purpose shortly, as we’ll see. After creating it, change to that directory with the cd command:
mkdir myusername
cd myusername
Please don’t break anything! ;^) Remember, you’re sharing this cluster.
Feel free to snoop around if you’re waiting for others. Note that all the Hadoop software is installed in the hadoop user’s $HOME directory, /home/hadoop.
Quick Cheat Sheet on Linux Shell Commands
If you’re not accustomed to the Linux or Mac OSX bash shell, here are a few hints[3]:
Print your current working directory
pwd
List the contents of a directory
Add the -l option to show a longer listing with more information. If you omit the directory, the current directory is used:
ls some-directory
ls -l some-directory
Change to a different directory
Four variants; using i) an absolute path, ii) a subdirectory of the current directory, iii) the parent directory of the current directory, and iv) your home directory:
cd /home/hadoop
cd exercises
cd ..
cd ~
Page through the contents of a file.
Hit the space bar to page, q to quit:
more some-file
Dump the contents without paging
I.e., “concatenate” or “cat” the file:
cat some-file
For More Information
For more information on Amazon Elastic MapReduce commands, see the Quick Reference Guide and the Developer Guide.
For more details on Hive, see Programming Hive or the Hive Wiki.
-
Visit The AWS EMR Page and the EMR Documentation page for more information about EMR. ↩
-
I used the following information to determine a good distribution of users across these clusters. Note that these EMR clusters will only be available during the time of the tutorial. ↩
-
You should learn how to use
bashif you want to use Hadoop. ↩
</div>