big_data

Big Data for beginners

Explore a variety of tutorials and interactive demonstrations focused on Big Data technologies like Hadoop, Spark, and more, primarily presented in the format of Jupyter notebooks.

Setting Up Hadoop: Single-Node Configuration

Hadoop_Setting_up_a_Single_Node_Cluster.ipynb Set up a single-node Hadoop cluster on Google Colab and run some basic HDFS and MapReduce examples
Hadoop_single_node_cluster_setup_Python.ipynb Set up a single-node Hadoop cluster on Google Colab using Python
Hadoop_minicluster.ipynb Deploy a test Hadoop Cluster with a single command and no need for configuration.

Running Apache Spark in Standalone Mode

Hadoop_Setting_up_Spark_Standalone_on_Google_Colab.ipynb Set up a single-node Spark server on Google Colab and estimate „π“ with a Montecarlo method
Run_Spark_on_Google_Colab.ipynb Set up a single-node standalone Spark server on Google Colab including Web UI and History Server - compact version
Spark_Standalone_Architecture_on_Google_Colab.ipynb Explore the Spark architecture through the immersive experience of deploying a standalone setup.
PySpark_On_Google_Colab.ipynb Explore the inner workings of PySpark on Google Colab

MapReduce Tutorials

MapReduce_Primer_HelloWorld.ipynb A MapReduce Primer with “Hello, World!”
MapReduce_Primer_HelloWorld_bash.ipynb A MapReduce Primer with “Hello, World! in Bash with just a few lines of code”
mapreduce_with_bash.ipynb An introduction to MapReduce using MapReduce Streaming and bash to create mapper and reducer
simplest_mapreduce_bash_wordcount.ipynb A very basic MapReduce wordcount example
mrjob_wordcount.ipynb A simple MapReduce job with mrjob
Hadoop_spilling.ipynb Hadoop spilling explained

PySpark Tutorials

demoSparkSQLPython.ipynb Pyspark basic demo
ngrams_with_pyspark.ipynb Basic example of n-grams extraction with PySpark
Encoding+dataframe+columns.ipynb DataFrame Column Encoding with PySpark and Parquet Format

Miscellaneous Tutorials

GutenbergBooks.ipynb Explore and download books from the Gutenberg books collection.
generate_data_with_Faker.ipynb Data Generation and Aggregation with Python's Faker Library and PySpark
TestDFSio.ipynb Demo of TestDFSio for benchmarking Hadoop clusters
Unicode.ipynb Exploring Unicode categories ()
polynomial_regression.ipynb Worked out example of polynomial regression with numpy
Apache_Sedona_with_PySpark.ipynb Apache Sedona™ is a high-performance cluster computing system for processing large-scale spatial data, extending the capabilities of Apache Spark for advanced geospatial analytics. Run a basic example with PySpark on Google Colab

Virtualization and Cloud Automation

docker_for_beginners.md Docker for beginners: an introduction to the world of containers
Terraform for beginners.md Getting started with Terraform
Terraform in 5 minutes A short introduction to Terraform, the powerful and popular tool for infrastructure provisioning and management

Big Data Learning Pathways

online_resources.md Online resources for learning Big Data

About this repository

Notebooks Testing and CI

Most executable Jupyter notebooks are tested on an Ubuntu virtual machine through a GitHub automated workflow. The log file for successful executions is named: action_log.txt.

Current status:

The Github workflow is a starting point for what is known as Continuous Integration (CI) in DevOps/Platform Engineering circles.

big_data
big_data copied to clipboard

Metadata

Big Data for beginners

Setting Up Hadoop: Single-Node Configuration

Running Apache Spark in Standalone Mode

MapReduce Tutorials

PySpark Tutorials

Miscellaneous Tutorials

Virtualization and Cloud Automation

Big Data Learning Pathways

About this repository

Notebooks Testing and CI

← Metadata

Owner

Metadata

big_data big_data copied to clipboard

Metadata

Big Data for beginners

Setting Up Hadoop: Single-Node Configuration

Running Apache Spark in Standalone Mode

MapReduce Tutorials

PySpark Tutorials

Miscellaneous Tutorials

Virtualization and Cloud Automation

Big Data Learning Pathways

About this repository

Notebooks Testing and CI

← Metadata

Owner

Metadata

big_data
big_data copied to clipboard