big_data icon indicating copy to clipboard operation
big_data copied to clipboard

Tutorials on Big Data essentials: Hadoop, MapReduce, Spark.

big_data

Big Data for beginners

Explore a variety of tutorials and interactive demonstrations focused on Big Data technologies like Hadoop, Spark, and more, primarily presented in the format of Jupyter notebooks.

Setting Up Hadoop: Single-Node Configuration

  • Hadoop_Setting_up_a_Single_Node_Cluster.ipynb Open In Colab Render in nbviewer Set up a single-node Hadoop cluster on Google Colab and run some basic HDFS and MapReduce examples
  • Hadoop_single_node_cluster_setup_Python.ipynb Open In Colab Render in nbviewer Set up a single-node Hadoop cluster on Google Colab using Python recently updated
  • Hadoop_minicluster.ipynb Open In Colab Render in nbviewer Deploy a test Hadoop Cluster with a single command and no need for configuration. recently updated

Running Apache Spark in Standalone Mode

  • Hadoop_Setting_up_Spark_Standalone_on_Google_Colab.ipynb Open In Colab Render in nbviewer Set up a single-node Spark server on Google Colab and estimate „π“ with a Montecarlo method
  • Run_Spark_on_Google_Colab.ipynb Open In Colab Render in nbviewer Set up a single-node standalone Spark server on Google Colab including Web UI and History Server - compact version recently updated
  • Spark_Standalone_Architecture_on_Google_Colab.ipynb Open In Colab Render in nbviewer Explore the Spark architecture through the immersive experience of deploying a standalone setup. recently updated
  • PySpark_On_Google_Colab.ipynb Open In Colab Render in nbviewer Explore the inner workings of PySpark on Google Colab recently updated

MapReduce Tutorials

  • MapReduce_Primer_HelloWorld.ipynb Open In Colab Render in nbviewer A MapReduce Primer with “Hello, World!” recently updated
  • MapReduce_Primer_HelloWorld_bash.ipynb Open In Colab Render in nbviewer A MapReduce Primer with “Hello, World! in Bash with just a few lines of code” recently updated
  • mapreduce_with_bash.ipynb An introduction to MapReduce using MapReduce Streaming and bash to create mapper and reducer
  • simplest_mapreduce_bash_wordcount.ipynb A very basic MapReduce wordcount example
  • mrjob_wordcount.ipynb A simple MapReduce job with mrjob
  • Hadoop_spilling.ipynb Hadoop spilling explained

PySpark Tutorials

  • demoSparkSQLPython.ipynb Pyspark basic demo
  • ngrams_with_pyspark.ipynb Open In Colab Render in nbviewer Basic example of n-grams extraction with PySpark recently updated
  • Encoding+dataframe+columns.ipynb Open In Colab Render in nbviewer DataFrame Column Encoding with PySpark and Parquet Format recently updated

Miscellaneous Tutorials

  • GutenbergBooks.ipynb Open In Colab Render in nbviewer Explore and download books from the Gutenberg books collection. recently updated
  • generate_data_with_Faker.ipynb Open In Colab Render in nbviewer Data Generation and Aggregation with Python's Faker Library and PySpark recently updated
  • TestDFSio.ipynb Demo of TestDFSio for benchmarking Hadoop clusters
  • Unicode.ipynb Exploring Unicode categories (live on Binder)
  • polynomial_regression.ipynb Worked out example of polynomial regression with numpy
  • Apache_Sedona_with_PySpark.ipynb Open In Colab Render in nbviewer Apache Sedona™ is a high-performance cluster computing system for processing large-scale spatial data, extending the capabilities of Apache Spark for advanced geospatial analytics. Run a basic example with PySpark on Google Colab recently updated

Virtualization and Cloud Automation

  • docker_for_beginners.md Docker for beginners: an introduction to the world of containers
  • Terraform for beginners.md Getting started with Terraform
  • Terraform in 5 minutes A short introduction to Terraform, the powerful and popular tool for infrastructure provisioning and management recently updated

Big Data Learning Pathways

  • online_resources.md Online resources for learning Big Data

About this repository

Notebooks Testing and CI

Most executable Jupyter notebooks are tested on an Ubuntu virtual machine through a GitHub automated workflow. The log file for successful executions is named: action_log.txt.

Current status: Run Notebooks on Ubuntu

The Github workflow is a starting point for what is known as Continuous Integration (CI) in DevOps/Platform Engineering circles.