learning-hadoop-and-spark
learning-hadoop-and-spark copied to clipboard
Companion to Learning Hadoop and Learning Spark courses on Linked In Learning
Learning Hadoop and Spark
Contents
This is the companion repo to my LinkedIn Learning Courses on Apache Hadoop and Apache Spark.
đ 1. Learning Hadoop - link
- uses mostly GCP Dataproc
- for running Hadoop & associated libraries (i.e. Hive, Pig, Spark...) workloads
đŠī¸ 2. Cloud Hadoop: Scaling Apache Spark - link
- uses GCP DataProc, AWS EMR --or--
- Databricks on AWS
âī¸ 3. Azure Databricks Spark Essential Training - link
- uses Azure with Databricks
- for scaling Apache Spark workloads
Development Environment Setup Information
You have a number of options - although it is possible for you to set up a local Hadoop/Spark cluster, I do NOT recommended this approach as it's needlessly complex for initial study. Rather I do recommend that you use a partially or fully-managed cluster. For learning, I most often use a fully-managed (free tier) cluster.
1. SaaS - Databricks --> MANAGED
Databricks offers managed Apache Spark clusters. Databricks can run on AWS, Azure or GCP --> announced in 20201 - link. In this course, I use Databricks running on AWS, as the community editor is simple and fast to set up for learning purposes.

- Use Databricks Community Edition (managed, hosted Apache Spark), run on AWS. Example notebook shown in screenshot above.
- uses Databricks (Jupyter-style) notebooks to connect to a one or more custom-sized and managed Spark clusters
- creates and manages your data files stored in cloud buckets as part of Databricks service
- uses DFS file system in cluster data operations
- use Databricks AWS community edition (simplest set up - free tier on AWS) - link --OR--
- use Databricks Azure trial edition - Azure may require a pay-as-you-go account to get needed CPU/GPU resources
- try Databricks on GCP beta - announced recently - link
2. PaaS Cloud on GCP (or AWS) --> PARTIALLY-MANAGED

- Setup a Hadoop/Spark managed cloud-cluster via GCP Dataproc or AWS EMR
- see
setup-hadoop
folder in this Repo for instructions/scripts- create a GCS (or AWS) bucket for input/output job data
- see
example_datasets
folder in this Repo for sample data files
- for GCP use DataProc includes Jupyter notebook interface --OR--
- for AWS use EMR you can use EMR Studio (which includes managed Jupyter instances) - link example screenshot shown above
- for Azure it is possible to use their HDInsight service. I prefer Databricks on Azure because I find it to be more feature complete and performant.
- see
3. IaaS local or cloud --> MANUAL
- Setup Hadoop/Spark locally or on a 'raw' cloud VM, such as AWS EC2
- NOT RECOMMENDED for learning - too complex to set up
- Cloudera Learning VM - also NOT recommended, changes too often, documentation not aligned
Example Jobs or Scripts
EXAMPLES from org.apache.hadoop_or_spark.examples
- link for Spark examples
- Run a Hadoop WordCount Job with Java (jar file)
- Run a Hadoop and/or Spark CalculatePi (digits) Script with PySpark or other libraries
- Run using Cloudera shared demo env
- at
https://demo.gethue.com/
- login is user:
demo
, pwd:demo
- at
Other LinkedIn Learning Courses on Hadoop or Spark
There are ~ 10 courses on Hadoop/Spark topics on LinkedIn Learning. See graphic below
-
Hadoop for Data Science Tips and Tricks - link
- Set up Cloudera Enviroment
- Working with Files in HDFS
- Connecting to Hadoop Hive
- Complex Data Structures in Hive
-
Spark courses - link
- Various Topics - see screenshot below