Big data techonlogy benchmarks This project is designed to compare big data technologies comparisons for cleaning, manipulating and generally wrangling data in purpose of analysis and machine learning.

The benchmarks for this article.

The analysis is done on a 100GB Texi data 2009 - 2015.

Technologies

[x] Pandas
[x] Vaex
[x] H2O
[x] Turicreate
[x] Dask
[x] PySpark
[x] koalas
[x] Modin
[x] Datatable

General Remarks

Some notebooks requeire a restart of the karnel after package installation.
Different notebooks run on different kernels, check out on the top what is what.
The notebooks of technologies who don't run out of core are set to work with only 1M rows.
On special cases notebooks needed to be restarted for optmial performance - that might not be fair, but I wanted to try to get the most out of each technology.

Instructions

Create an S3 bucket to put your results (or remove this part in the persist function in the code).
Create a ml.c5d.4xlarge instance on AWS SageMaker with extra 500G Stroage.
Run the get_data.ipynb notebook to mount the SSD and download the data.
Run the notebook you want to test.

In each notebook and the beginning, make sure the name of the instance and the S3 bucket is right.

Good luck!

big_data_benchmarks
big_data_benchmarks copied to clipboard

Metadata

General Remarks

Instructions

← Metadata

Owner

Metadata

big_data_benchmarks big_data_benchmarks copied to clipboard

Metadata

General Remarks

Instructions

← Metadata

Owner

Metadata

big_data_benchmarks
big_data_benchmarks copied to clipboard