big_data
big_data copied to clipboard
Tutorials on Big Data essentials: Hadoop, MapReduce, Spark.
Big Data for beginners
Explore a variety of tutorials and interactive demonstrations focused on Big Data technologies like Hadoop, Spark, and more, primarily presented in the format of Jupyter notebooks.
Setting Up Hadoop: Single-Node Configuration
-
Hadoop_Setting_up_a_Single_Node_Cluster.ipynb
Set up a single-node Hadoop cluster on Google Colab and run some basic HDFS and MapReduce examples
-
Hadoop_single_node_cluster_setup_Python.ipynb
Set up a single-node Hadoop cluster on Google Colab using Python
-
Hadoop_minicluster.ipynb
Deploy a test Hadoop Cluster with a single command and no need for configuration.
Running Apache Spark in Standalone Mode
-
Hadoop_Setting_up_Spark_Standalone_on_Google_Colab.ipynb
Set up a single-node Spark server on Google Colab and estimate „π“ with a Montecarlo method
-
Run_Spark_on_Google_Colab.ipynb
Set up a single-node standalone Spark server on Google Colab including Web UI and History Server - compact version
-
Spark_Standalone_Architecture_on_Google_Colab.ipynb
Explore the Spark architecture through the immersive experience of deploying a standalone setup.
-
PySpark_On_Google_Colab.ipynb
Explore the inner workings of PySpark on Google Colab
MapReduce Tutorials
-
MapReduce_Primer_HelloWorld.ipynb
A MapReduce Primer with “Hello, World!”
-
MapReduce_Primer_HelloWorld_bash.ipynb
A MapReduce Primer with “Hello, World! in Bash with just a few lines of code”
- mapreduce_with_bash.ipynb An introduction to MapReduce using MapReduce Streaming and bash to create mapper and reducer
- simplest_mapreduce_bash_wordcount.ipynb A very basic MapReduce wordcount example
- mrjob_wordcount.ipynb A simple MapReduce job with mrjob
- Hadoop_spilling.ipynb Hadoop spilling explained
PySpark Tutorials
- demoSparkSQLPython.ipynb Pyspark basic demo
-
ngrams_with_pyspark.ipynb
Basic example of n-grams extraction with PySpark
-
Encoding+dataframe+columns.ipynb
DataFrame Column Encoding with PySpark and Parquet Format
Miscellaneous Tutorials
-
GutenbergBooks.ipynb
Explore and download books from the Gutenberg books collection.
-
generate_data_with_Faker.ipynb
Data Generation and Aggregation with Python's Faker Library and PySpark
- TestDFSio.ipynb Demo of TestDFSio for benchmarking Hadoop clusters
-
Unicode.ipynb Exploring Unicode categories (
)
- polynomial_regression.ipynb Worked out example of polynomial regression with numpy
-
Apache_Sedona_with_PySpark.ipynb
Apache Sedona™ is a high-performance cluster computing system for processing large-scale spatial data, extending the capabilities of Apache Spark for advanced geospatial analytics. Run a basic example with PySpark on Google Colab
Virtualization and Cloud Automation
- docker_for_beginners.md Docker for beginners: an introduction to the world of containers
- Terraform for beginners.md Getting started with Terraform
-
Terraform in 5 minutes A short introduction to Terraform, the powerful and popular tool for infrastructure provisioning and management
Big Data Learning Pathways
- online_resources.md Online resources for learning Big Data
About this repository
Notebooks Testing and CI
Most executable Jupyter notebooks are tested on an Ubuntu virtual machine through a GitHub automated workflow. The log file for successful executions is named: action_log.txt.
The Github workflow is a starting point for what is known as Continuous Integration (CI) in DevOps/Platform Engineering circles.