awesome-pandas-alternatives
awesome-pandas-alternatives copied to clipboard
Awesome list of alternative dataframe libraries in Python.
Awesome Pandas Alternatives 
A curated list of awesome Python frameworks, libraries, software and resources.
Table of Contents
- Pandas Shortcomings
- Almost Under Every Hood: Apache Arrow
- In-Memory DataFrame Libraries
- Distributed Computing Libraries
- Row-based storage
- Columnar Based Storage
- Using Rust
- Using Ray
- Higher Level APIs
- GPU Libraries
- Other Libraries and Ports from R
Pandas Shortcomings
We all love pandas and most of us have learnt how to do data manipulation using this amazing library. However useful and great pandas is, unfortunately it has some well-known shortcomings that developers have been trying to address in the past years. Here are the most common weak points:
- You might read that
pandasgenerall requires as much RAM as 5-10 times the dataset you are working with, mostly because many operations generate under the hood a in-memory copy of the data; pandaseagerly executes code, i.e. executes every statement sequentially. For example, if you read a.csvfile and then filter it on a specific column (say,col2 > 5), the whole dataset will be first read into memory and then the subset you are interested in will be returned. Of course, you could manually writepandascommand sequentially to improve on efficiency, but one can only do so much. For this reason, many of the pandas alternatives implementlazyevaluation - i.e. do not execute statements until a.collect()or.compute()method is called - and include a query execution engine to optimise the order of the operations (read more here).
This awesome-repo aims to gather the libraries meant to overcome pandas weaknesses, as well as some resources to learn them. Everyone is encouraged to add new projects, or edit the descriptions to correct potential mistakes.
Almost Under Every Hood: Apache Arrow
Most of these libraries leverage Apache Arrow, "a language-independent columnar memory format". In other words, unlike good old .csv files that are stored by rows, Apache Arrow storage is (also) column-based. This allows partitioning the data in chunks with a lot of clever tricks to enable greater compression (like storing sequences of repeated values) and faster queries (because each chunk also stores metadata like the min or max value).
Arrow offers Python bindings with its Python API, named pyarrow. This library has modules to read and write data with either Arrow formats (.parquet most notably, but also .feather) and other formats like .csv and .json, but also data from cloud storage services and in a streaming fashion, which means data is processed in batches and does not need to be read wholly into memory. The module pyarrow.compute allows to perform basic data manipulation.
On its own, pyarrow is rarely being used as a standalone library to perform data manipulation: usually more expressive and feature rich modules are built upon Arrow, especially on its fast C++ or Rust API interface. For this reason, most of the libraries listed here will display a general landing page and links to other languages APIs (mostly, Python and R). To be honest, the R {arrow} interface has a backend for {dplyr} (the equivalent of pandas in R), which makes its use more straightforward. Development for the R {arrow} package is quite active!
In-Memory DataFrame Libraries
These libraries leverage Apache Arrow memory format to implement a parallelised and lazy execution engine. These are designed to take advantage of all the cores (and threads) of a machine, but are mainly geared towards dealing with in-memory data (i.e., with read_csv()-like functions to read the data into the memory of your machine, instead of processing it on a cluster node).
polars: Polars claims to be "a blazingly fast DataFrames library implemented in Rust using Apache Arrow Columnar Format as memory model". It leverages (quasi-)lazy evaluation, uses all cores, has multithreaded operations and its query engine is written in Rust.polarshas an expressive and high-level API that enables complex operations.duckdb: is another fairly recent DataFrame library. It offers both a SQL interface and a Python API: in other words, it can be used to query.csvand Arrow's.parquetfiles, but also in-memorypandas.DataFrames using both SQL and a syntax closer to Python orpandas. It supports window functions.duckdbhas a very nice blog that explains its optimistations under the hood: for example, here you can find an overview of how Apache's.parquetformat works and the performance tricks used by the query engine to run several orders of magnitude faster thanpandas.
Distributed Computing Libraries
Compared to the libraries above, the following are meant to orchestrate data manipulation over clusters, i.e. distribute computing across several nodes (multiple machines with multiple processors) via a query execution engine. Since they need to plan execution preemptively, they can be slower than pandas on a single-core machine.
Row-based storage
These are the "first generation" of query planners that - as of now - are not built around columnar storage. Nonetheless, they represent the industry standard for distributed computing, and offer much more than data manipulation: they can even implement machine learning libraries to train models on the cluster. The downside is that moving from pandas to, say, Apache's spark is not straightforward, as the API syntax can differ.
daskis among the first distributed computing DataFrame libraries, alongsidespark.daskdoes not only offer a distributed computing equivalent ofpandas, but also ofnumpyandscikit-learn, so that "you don't have to completely rewrite your code or retrain to scale up", i.e. to run code on distributed clusters. A Dask DataFrame "is a large parallel DataFrame composed of many smaller Pandas DataFrames, split along the index".spark: the de-facto leader of distributed computing, now it is faring more competition. It offersMLib, a machine learning library, as well asGraphXfor "graphs and graph-parallel computation". Its Python API ispyspark.
Columnar Based Storage
The "next generation" of distributed query planners, that leverage columnar-based storage.
Using Rust
arrow-datafusion: this is Arrow's query engine, written in the Rust programming language. Much likeduckdb,datafusionoffers both a SQL and Python-like API interface. Much likepyspark,datafusion"allows you to build a plan through SQL or a DataFrame API against in-memory data, parquet or CSV files, run it in a multi-threaded environment, and obtain the result back in Python".ballistais the distributed query engine, written in Rust, built on top ofarrow-datafusion. Here is a comparison betweenballistaandspark.ballistawill offer a Python client API.
Using Ray
The following libraries leverage Ray as their default distributing engine. Ray is a Python API for building distributed applications (or, in technical jargon, "scaling your code") - i.e., it offers a series of functions to make your code run on multiple computing nodes. Using their own words, Ray is a "cloud-provider independent compute launcher/autoscaler", i.e. can be used seamlessly on cloud platforms such as GoogleCloud, Amazon AWS and the likes. Ray can be used to parallelise code - such as your own scripts - but also for every step of machine learning: hyperparameter tuning and model serving, for instance. The community has built many libraries to build a bridge between Ray and libraries such as XGBoost and scikit-learn, lightGBM, PyTorch Lightning, Ariflow, PyCaret and many more.
Ray itself has a Datasets "loading and pre-processing library" to load and exchange data in Ray libraries and applications. As explained in the blogpost of the official launch, however, "Datasets is not intended as a replacement for generic data processing systems like Spark. Datasets is meant to be the last-mile bridge between ETL pipelines and distributed applications running on Ray". Under the hood, though, Ray Datasets still leverages Apache Arrow to store in-memory tables in the distributed clusters.
modinattempts to parallelize as much of thepandasAPI as is possible. Its developers claim to "have worked through a significant portion of the DataFrame API" such much so thatmodin"is intended to be used as a drop-in replacement forpandas, such that even if the API is not yet parallelized, it is still defaulting topandas". The library is currently under active development.- Under the hood,
modincalls either todask,rayor OmiSciDB to automatically orchestrate the tasks across cores on your machine or cluster. A good overview of what happens is here. - Offers a spreadsheet-like API to modify dataframes and a whole lot of other experimental features, like using SQL queries on DataFrames and even fitting XGBoost models using distributed computing.
- Has excellent documentation, including explanatory articles on the differences between
modinandpandasordask(also here).
- Under the hood,
marsis a "unified framework for large-scale data computation", likedask, but is tensor-based. It offers modules to scale numpy, pandas, scikit-learn and many other libraries.
Higher Level APIs
These are libraries that implement an abstract layer to make pandas code easier to reuse across distributed frameworks, mainly dask and spark. It is to be noted that, on October 2021, pyspark adopted a pandas-like API.
fugueis "a unified interface for distributed computing that lets users execute Python, pandas, and SQL code on Spark and Dask without rewrites". Simply put,fugueadds an abstract layer that makes code portable between across differing computing frameworks.koalasis a wrapper aroundsparkthat offerspandas-like APIs. It is no longer developed, sincepysparkadopted thepandasAPI andkoalaswas merged intopyspark.
GPU Libraries
Generally libraries work on CPUs and clusters are usually made up of CPUs. Apart from some notable exceptions, such as deep learning libraries like Tensorflow and PyTorch, usually regular libraries do not work on GPUs. This is due to major architectural differences across the two chips.
cuDFis a GPU dataframe library, which is part of the RapidsAI framework, that enables "end-to-end data science and analytics pipelines entirely on GPUs". There are many other libraries, likecuMLfor machine learning,cuspatialfor spatial data manipulations, and more.cuDFis based on Apache Arrow, because the memory format is compatible with both CPU and GPU architecture.blazingSQLis "is a GPU accelerated [distributed] SQL engine built on top of the RAPIDS ecosystem" and, as such, leverages Apache Arrow. Think of this as Apachesparkon GPU.
Other Libraries and Ports from R
R has an amazing library called {dplyr} that enables easy data manipulation. {dplyr} is part of the so-called {tidyverse}, a unified framework for data manipulation and visualisation.
pyjanitorwas originally conceived as apandasextension of the well-known{janitor}R package. The latter was a package to clean strings with ease in a R'data.frameortibbleobjects, but later incorporated new methods to make it more similar to{dplyr}. Adding and removing column, for example, is easier with the dedicated methodsdf.remove_column()anddf.add_column(), but also renaming column is easier withdf.rename_column(). This enables to run smoother pipelines that exploit method chaining.pydatatableis a Python port of the astounding{data.table}library in R, that achieves impressive results thanks to parallelisation.pandasqlallows to querypandas.DataFrames using SQL syntax.