learn-data-munging
learn-data-munging copied to clipboard
Notes on Data Engineering with Pandas, PySpark, Dask, Ray, Arrow DataFusion, Polars etc.
Data Munging Using *X* in Python, Rust & Julia
Data Engineering Workshops on some of the more popular libraries, frameworks and tech circa 2023-2024.
Data Engineers working with Python, Rust and Julia :P
Notebooks
00 Python Collections
This set of notebooks works through examples of how some pretty sophisticated data engineering can be done using Python Collections, Itertools and Functools. It uses the small MovieLens dataset.
- Basic Collections and the
CollectionsModule: Notebook also
01 Numpy
- NumPy vs Python Collections Notebook also
02 Pandas
- Wrangling MovieLens with Pandas - Part 1: Getting Started, Load the MovieLens dataset: Notebook also
- Wrangling MovieLens with Pandas - Part 2: Playing with the Movies and Ratings data: Notebook also
03 Spark
01 - Toy introduction to the basics
-
01 - Setting up Spark locally (on Windows): Notebook also
-
02 - How to run Apache Spark based notebooks in Google Colab: Notebook also
02 - A set of notebooks exploring data wrangling in depth using the MovieLens dataset
-
Part 01: Overview, Starting Spark and Loading the data: Notebook or
-
Part 02: Data Analysis basics using tags.csv from the MovieLens dataset: Notebook or
04 Dask
- Distributed Data Analysis with Dask - Part 1: Getting Started, Load the MovieLens dataset: Notebook also
- Distributed Data Analysis with Dask - Part 2: Playing with the Movies data: Notebook also
05 Polars
- Polars with the MovieLens dataset - Getting Started, Load the MovieLens dataset, A quick look at Arrow, and some analysis: Notebook also
06 Apache Arrow and DataFusion
- 01 - 10+ minutes to Arrow+DataFusion+Ballista [WIP]: Notebook also
07 Ray
- [WIP]
99 Static: The TPC Benchmark Queries
- [WIP]
Note
The "10+ minutes to XX" notebooks are just references, not to be run as actual workshop material. These are there to carry toy examples that "getting started" pages for XX carry. I have tried to ensure there's a 10+ minutes notebook for each data engineering library/framework considered here. While it may be interesting to go through these to quickly refresh the syntax and other idiosyncracies, the actual data munging happens in other notebooks.
References
01 Numpy
- Numpy User Guide (v1.23 as of this)
- Numpy Tutorials
- NumPy Basics: Arrays and Vectorized Computation from Wes Mckinney's Python for Data Analysis, 3E:
- Numpy is absurd
- 100 Numpy Exercises
- From Python to Numpy
02 Pandas
- Pandas (current stable version) User Guide
- 10 minutes to pandas
- Data Cleaning and Preparation from Wes Mckinney's Python for Data Analysis, 3E:
- Data Wrangling: Join, Combine, and Reshape from Wes Mckinney's Python for Data Analysis, 3E:
- Data Aggregation and Group Operations from Wes Mckinney's Python for Data Analysis, 3E:
- Effective Pandas | Matt Harrison, also here
- ...also from Matt harrison on github: effective pandas (book) and idiomatic pandas tutorial
- Pandas Exercises
- 100 Pandas Puzzles
03 Spark
- Spark User Guide
- The Internals of Apache Spark online book
- PySpark User Guide
- This is also available as live binder notebooks:
- Spark SQL and Built-in Functions Reference
- weak references, some dated but interesting
- PySpark Cheatsheet
- The "Data Savvy" YouTube Channel
04 Dask
The approach is different: Dask focuses on Task scheduling vs Spark's Map-Reduce
- 10 minutes to Dask
- 90-minute Dask tutorial video
- Talks and tutorials page
- The Dask tutorial notebooks
- The SciPy 2022 tutorial talk
- Journey of a Task
- High level performance of Pandas, Dask, Spark, and Arrow - from Dask Working Notes Blog
- Dask distributed
- Dask Task Graphs
- Tornado - used by Dask distributed
- For some Dask exercises, we may need GraphViz or Cytoscape and ipycytoscape
05 Polars
06 Arrow, Arrow DataFusion and Ballista
- Apache Arrow Official Native Rust Implementation
- pyArrow
- Apache Arrow Python Cookbook
- DataFusion User Guide
- Arrow DataFusion Python
- DataFusion Roadmap Epics
- Ballista on GitHub
- Arrow NumPy Integration
- Arrow Pandas Integration
07 Ray
- Ray Core
- Ray Dataset Quickstart
- Ray Data
- Ray with Spark
- Ray with Dask
Future State / Miscellany
Datasets we use:
- MovieLens 25M Dataset
- Wikipedia Movie Plots
- CMU Movie Summary Corpus also here
- MoviePlotEvents (CMU Movie Summary Corpus with Events) also here
There's a lot of interesting (interesting to me) tools, datasets and papers out there.
When there's time or need, we'll get to them as well.
- Arrow and pyArrow really warrant a deeper study. Maybe a gateway to Rust based data processing. Not really emerging anymore, a lot of very cool stuff is being done with this and datafusion, very interesting to explore.
- Apache Arrow Ballista is looking very interesting from a next gen distributed processing PoV
- PRQL, on github and PRQL Query. Also the PRQL Book.
- Mars and Project Mars on GitHub
- Modin
- Polars. Also, Polars Github Repo
- DuckDB, GitHub
- FoundationDB, GitHub
- Danfo.js - pandas like dataframes in JavaScript
- Velox also GitHub and Gluten, also GitHub
- I think there's something to be said about leveraging TPC benchmarks - we'll attend to this in due time. There's got to be a .md readme in this repo that'll list all the queries anyway. Yea, lemme do that soonish.
- Is there value in comparing formats? (Parquet)[https://parquet.apache.org/docs/], (Zarr)[https://zarr.readthedocs.io/en/stable/tutorial.html] etc.?
- Papers and Data - Scifi TV Shows (Scifi TV Show Plot Summaries & Events)
- Papers and Data - Story Cloze
- State Of The Art on paperswithcode (
- Only cause LLMs have been trending for a while - A Survey of Large Language Models
- SST (Stanford Sentiment Treebank), also
- ...
MOAR GIMME MOAR LINKS!!!
Kitchen sink of all other references I've found useful (or wonderful). There's so much to learn I tell you!
- How Query Engines Work
- Carnegie Mellon's Advanced Database Systems Playlist:
- Go here if the advanced database systems feels hard - CMU Intro to Database Systems (15-445/645 - Fall 2022), also course site
- Database Query Optimizers
- ¡Databases! – A Database Seminar Series (Fall 2022), also on CMU
- Hardware Accelerated Database Lectures (Fall 2018)
- Time Series Database Lectures (Fall 2017)
- The Databaseology Lectures (Fall 2015)
- Seven Databases in Seven Weeks (Fall 2014)
- This explanation for List Comprehensions
.