awesome-python-data-science
awesome-python-data-science copied to clipboard
Probably the best curated list of data science software in Python.
Awesome Python Data Science
Probably the best curated list of data science software in Python
Contents
- Machine Learning
- Deep Learning
- Web Scraping
- Data Manipulation
- Feature Engineering
- Visualization
- Deployment
- Model Explanation
- Reinforcement Learning
- Probabilistic Methods
- Genetic Programming
- Optimization
- Time Series
- Natural Language Processing
- Computer Audition
- Computer Vision
- Statistics
- Distributed Computing
- Experimentation
- Evaluation
- Computations
- Spatial Analysis
- Quantum Computing
- Conversion
Machine Learning
General Purpouse Machine Learning
-
scikit-learn - Machine learning in Python.
- Shogun - Machine learning toolbox.
- xLearn - High Performance, Easy-to-use, and Scalable Machine Learning Package.
-
cuML - RAPIDS Machine Learning Library.
-
modAL - Modular active learning framework for Python3.
-
Sparkit-learn - PySpark + scikit-learn = Sparkit-learn.
- mlpack - A scalable C++ machine learning library (Python bindings).
- dlib - Toolkit for making real world machine learning and data analysis applications in C++ (Python bindings).
-
MLxtend - Extension and helper modules for Python's data analysis and machine learning libraries.
-
hyperlearn - 50%+ Faster, 50%+ less RAM usage, GPU support re-written Sklearn, Statsmodels.
-
Reproducible Experiment Platform (REP) - Machine Learning toolbox for Humans.
-
scikit-multilearn - Multi-label classification for python.
-
seqlearn - Sequence classification toolkit for Python.
-
pystruct - Simple structured learning framework for Python.
-
sklearn-expertsys - Highly interpretable classifiers for scikit learn.
-
RuleFit - Implementation of the rulefit.
-
metric-learn - Metric learning algorithms in Python.
- pyGAM - Generalized Additive Models in Python.
- Karate Club - An unsupervised machine learning library for graph structured data.
- Little Ball of Fur - A library for sampling graph structured data.
-
causalml - Uplift modeling and causal inference with machine learning algorithms.
-
Deepchecks - Validation & testing of ML models and data during model development, deployment, and production.
Automated Machine Learning
-
TPOT - Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
-
auto-sklearn - An automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator.
- MLBox - A powerful Automated Machine Learning python library.
Ensemble Methods
-
ML-Ensemble - High performance ensemble learning.
-
Stacking - Simple and useful stacking library, written in Python.
-
stacked_generalization - Library for machine learning stacking generalization.
-
vecstack - Python package for stacking (machine learning technique).
Imbalanced Datasets
-
imbalanced-learn - Module to perform under sampling and over sampling with various techniques.
-
imbalanced-algorithms - Python-based implementations of algorithms for learning on imbalanced data.
Random Forests
-
rpforest - A forest of random projection trees.
-
sklearn-random-bits-forest - Wrapper of the Random Bits Forest program written by (Wang et al., 2016).
-
rgf_python - Python Wrapper of Regularized Greedy Forest.
Extreme Learning Machine
-
Python-ELM - Extreme Learning Machine implementation in Python.
- Python Extreme Learning Machine (ELM) - A machine learning technique used for classification/regression tasks.
-
hpelm - High performance implementation of Extreme Learning Machines (fast randomized neural networks).
Kernel Methods
-
pyFM - Factorization machines in python.
-
fastFM - A library for Factorization Machines.
-
tffm - TensorFlow implementation of an arbitrary order Factorization Machine.
- liquidSVM - An implementation of SVMs.
-
scikit-rvm - Relevance Vector Machine implementation using the scikit-learn API.
-
ThunderSVM - A fast SVM Library on GPUs and CPUs.
Gradient Boosting
-
XGBoost - Scalable, Portable and Distributed Gradient Boosting.
-
LightGBM - A fast, distributed, high performance gradient boosting.
-
CatBoost - An open-source gradient boosting on decision trees library.
-
ThunderGBM - Fast GBDTs and Random Forests on GPUs.
Deep Learning
PyTorch
-
PyTorch - Tensors and Dynamic neural networks in Python with strong GPU acceleration.
-
torchvision - Datasets, Transforms and Models specific to Computer Vision.
-
torchtext - Data loaders and abstractions for text and NLP.
-
torchaudio - An audio library for PyTorch.
-
ignite - High-level library to help with training neural networks in PyTorch.
- PyToune - A Keras-like framework and utilities for PyTorch.
-
skorch - A scikit-learn compatible neural network library that wraps pytorch.
-
PyTorchNet - An abstraction to train neural networks.
-
pytorch_geometric - Geometric Deep Learning Extension Library for PyTorch.
-
Catalyst - High-level utils for PyTorch DL & RL research.
-
pytorch_geometric_temporal - Temporal Extension Library for PyTorch Geometric.
-
ChemicalX - A PyTorch based deep learning library for drug pair scoring.
TensorFlow
-
TensorFlow - Computation using data flow graphs for scalable machine learning by Google.
-
TensorLayer - Deep Learning and Reinforcement Learning Library for Researcher and Engineer.
-
TFLearn - Deep learning library featuring a higher-level API for TensorFlow.
-
Sonnet - TensorFlow-based neural network library.
-
tensorpack - A Neural Net Training Interface on TensorFlow.
-
Polyaxon - A platform that helps you build, manage and monitor deep learning models.
-
NeuPy - NeuPy is a Python library for Artificial Neural Networks and Deep Learning (previously:
).
-
tfdeploy - Deploy tensorflow graphs for fast evaluation and export to tensorflow-less environments running numpy.
-
tensorflow-upstream - TensorFlow ROCm port.
-
TensorFlow Fold - Deep learning with dynamic computation graphs in TensorFlow.
-
tensorlm - Wrapper library for text generation / language models at char and word level with RNN.
-
TensorLight - A high-level framework for TensorFlow.
-
Mesh TensorFlow - Model Parallelism Made Easier.
-
Ludwig - A toolbox, that allows to train and test deep learning models without the need to write code.
-
Keras - A high-level neural networks API running on top of TensorFlow.
-
keras-contrib - Keras community contributions.
-
Hyperas - Keras + Hyperopt: A very simple wrapper for convenient hyperparameter.
-
Elephas - Distributed Deep learning with Keras & Spark.
-
Hera - Train/evaluate a Keras model, get metrics streamed to a dashboard in your browser.
-
Spektral - Deep learning on graphs.
-
qkeras - A quantization deep learning library.
MXNet
-
MXNet - Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler.
-
Gluon - A clear, concise, simple yet powerful and efficient API for deep learning (now included in MXNet).
-
MXbox - Simple, efficient and flexible vision toolbox for mxnet framework.
-
gluon-cv - Provides implementations of the state-of-the-art deep learning models in computer vision.
-
gluon-nlp - NLP made easy.
-
Xfer - Transfer Learning library for Deep Neural Networks.
-
MXNet - HIP Port of MXNet.
Others
- Tangent - Source-to-Source Debuggable Derivatives in Pure Python.
- autograd - Efficiently computes derivatives of numpy code.
- Myia - Deep Learning framework (pre-alpha).
- nnabla - Neural Network Libraries by Sony.
- Caffe - A fast open framework for deep learning.
-
hipCaffe - The HIP port of Caffe.
Web Scraping
- BeautifulSoup: The easiest library to scrape static websites for beginners
- Scrapy: Fast and extensible scraping library. Can write rules and create customized scraper without touching the coure
- Selenium: Use Selenium Python API to access all functionalities of Selenium WebDriver in an intuitive way like a real user.
- Pattern: High level scraping for well-establish websites such as Google, Twitter, and Wikipedia. Also has NLP, machine learning algorithms, and visualization
- twitterscraper: Efficient library to scrape twitter
Data Manipulation
Data Containers
- pandas - Powerful Python data analysis toolkit.
- pandas_profiling - Create HTML profiling reports from pandas DataFrame objects
-
cuDF - GPU DataFrame Library.
-
blaze - NumPy and pandas interface to Big Data.
-
pandasql - Allows you to query pandas DataFrames using SQL syntax.
-
pandas-gbq - pandas Google Big Query.
- xpandas - Universal 1d/2d data containers with Transformers .functionality for data analysis by The Alan Turing Institute.
-
pysparkling - A pure Python implementation of Apache Spark's RDD and DStream interfaces.
- Arctic - High performance datastore for time series and tick data.
-
datatable - Data.table for Python.
-
koalas - pandas API on Apache Spark.
-
modin - Speed up your pandas workflows by changing a single line of code.
- swifter - A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner.
- pandas_flavor - A package which allow to write your own flavor of Pandas easily.
- pandas-log - A package which allow to provide feedback about basic pandas operations and find both buisness logic and performance issues.
- vaex - Out-of-Core DataFrames for Python, ML, visualize and explore big tabular data at a billion rows per second.
Pipelines
- pdpipe - Sasy pipelines for pandas DataFrames.
- SSPipe - Python pipe (|) operator with support for DataFrames and Numpy and Pytorch.
-
pandas-ply - Functional data manipulation for pandas.
-
Dplython - Dplyr for Python.
-
sklearn-pandas - pandas integration with sklearn.
- Dataset - Helps you conveniently work with random or sequential batches of your data and define data processing.
-
pyjanitor - Clean APIs for data cleaning.
- meza - A Python toolkit for processing tabular data.
- Prodmodel - Build system for data science pipelines.
-
dopanda - Hints and tips for using pandas in an analysis environment.
- CircleCi: Automates your software builds, tests, and deployments.
Feature Engineering
General
- Featuretools - Automated feature engineering.
-
skl-groups - A scikit-learn addon to operate on set/"group"-based features.
-
Feature Forge - A set of tools for creating and testing machine learning feature.
-
few - A feature engineering wrapper for sklearn.
-
scikit-mdr - A sklearn-compatible Python implementation of Multifactor Dimensionality Reduction (MDR) for feature construction.
-
tsfresh - Automatic extraction of relevant features from time series.
Feature Selection
- scikit-feature - Feature selection repository in python.
-
boruta_py - Implementations of the Boruta all-relevant feature selection method.
-
BoostARoota - A fast xgboost feature selection algorithm.
-
scikit-rebate - A scikit-learn-compatible Python implementation of ReBATE, a suite of Relief-based feature selection algorithms for Machine Learning.
Visualization
General Purposes
- Matplotlib - Plotting with Python.
- seaborn - Statistical data visualization using matplotlib.
- prettyplotlib - Painlessly create beautiful matplotlib plots.
- python-ternary - Ternary plotting library for python with matplotlib.
- missingno - Missing data visualization module for Python.
- chartify - Python library that makes it easy for data scientists to create charts.
- physt - Improved histograms.
Interactive plots
- animatplot - A python package for animating plots build on matplotlib.
- plotly - A Python library that makes interactive and publication-quality graphs.
- Bokeh - Interactive Web Plotting for Python.
- Altair - Declarative statistical visualization library for Python. Can easily do many data transformation within the code to create graph
- bqplot - Plotting library for IPython/Jupyter notebooks
-
pyecharts - Migrated from Echarts, a charting and visualization library, to Python's interactive visual drawing library.
Map
- folium - Makes it easy to visualize data on an interactive open street map
- geemap - Python package for interactive mapping with Google Earth Engine (GEE)
Automatic Plotting
- HoloViews - Stop plotting your data - annotate your data and let it visualize itself.
- AutoViz: Visualize data automatically with 1 line of code (ideal for machine learning)
- SweetViz: Visualize and compare datasets, target values and associations, with one line of code.
NLP
- pyLDAvis: Visualize interactive topic model
Deployment
- datapane - A collection of APIs to turn scripts and notebooks into interactive reports.
- binder - Enable sharing and execute Jupyter Notebooks
- fastapi - Modern, fast (high-performance), web framework for building APIs with Python
- streamlit - Make it easy to deploy machine learning model
Model Explanation
- Shapley - A data-driven framework to quantify the value of classifiers in a machine learning ensemble.
- Alibi - Algorithms for monitoring and explaining machine learning models.
- anchor - Code for "High-Precision Model-Agnostic Explanations" paper.
- aequitas - Bias and Fairness Audit Toolkit.
-
Contrastive Explanation - Contrastive Explanation (Foil Trees).
-
yellowbrick - Visual analysis and diagnostic tools to facilitate machine learning model selection.
-
scikit-plot - An intuitive library to add plotting functionality to scikit-learn objects.
-
shap - A unified approach to explain the output of any machine learning model.
- ELI5 - A library for debugging/inspecting machine learning classifiers and explaining their predictions.
-
Lime - Explaining the predictions of any machine learning classifier.
-
FairML - FairML is a python toolbox auditing the machine learning models for bias.
- L2X - Code for replicating the experiments in the paper Learning to Explain: An Information-Theoretic Perspective on Model Interpretation.
- PDPbox - Partial dependence plot toolbox.
-
pyBreakDown - Python implementation of R package breakDown.
- PyCEbox - Python Individual Conditional Expectation Plot Toolbox.
- Skater - Python Library for Model Interpretation.
-
model-analysis - Model analysis tools for TensorFlow.
-
themis-ml - A library that implements fairness-aware machine learning algorithms.
-
treeinterpreter - Interpreting scikit-learn's decision tree and random forest predictions.
- AI Explainability 360 - Interpretability and explainability of data and machine learning models.
- Auralisation - Auralisation of learned features in CNN (for audio).
- CapsNet-Visualization - A visualization of the CapsNet layers to better understand how it works.
- lucid - A collection of infrastructure and tools for research in neural network interpretability.
- Netron - Visualizer for deep learning and machine learning models (no Python code, but visualizes models from most Python Deep Learning frameworks).
- FlashLight - Visualization Tool for your NeuralNetwork.
- tensorboard-pytorch - Tensorboard for pytorch (and chainer, mxnet, numpy, ...).
-
mxboard - Logging MXNet data for visualization in TensorBoard.
Reinforcement Learning
- OpenAI Gym - A toolkit for developing and comparing reinforcement learning algorithms.
- Coach - Easy experimentation with state of the art Reinforcement Learning algorithms.
- garage - A toolkit for reproducible reinforcement learning research.
- OpenAI Baselines - High-quality implementations of reinforcement learning algorithms.
- Stable Baselines - A set of improved implementations of reinforcement learning algorithms based on OpenAI Baselines.
- RLlib - Scalable Reinforcement Learning.
- Horizon - A platform for Applied Reinforcement Learning.
-
TF-Agents - A library for Reinforcement Learning in TensorFlow.
-
TensorForce - A TensorFlow library for applied reinforcement learning.
-
TRFL - TensorFlow Reinforcement Learning.
- Dopamine - A research framework for fast prototyping of reinforcement learning algorithms.
-
keras-rl - Deep Reinforcement Learning for Keras.
- ChainerRL - A deep reinforcement learning library built on top of Chainer.
Probabilistic Methods
-
pyro - A flexible, scalable deep probabilistic programming library built on PyTorch.
-
pomegranate - Probabilistic and graphical models for Python.
-
ZhuSuan - Bayesian Deep Learning.
- PyMC - Bayesian Stochastic Modelling in Python.
-
InferPy - Deep Probabilistic Modelling Made Easy.
-
GPflow - Gaussian processes in TensorFlow.
- PyStan - Bayesian inference using the No-U-Turn sampler (Python interface).
-
sklearn-bayes - Python package for Bayesian Machine Learning with scikit-learn API.
- pgmpy - A python library for working with Probabilistic Graphical Models.
-
skpro - Supervised domain-agnostic prediction framework for probabilistic modelling by The Alan Turing Institute.
-
PtStat - Probabilistic Programming and Statistical Inference in PyTorch.
-
PyVarInf - Bayesian Deep Learning methods with Variational Inference for PyTorch.
- emcee - The Python ensemble sampling toolkit for affine-invariant MCMC.
- hsmmlearn - A library for hidden semi-Markov models with explicit durations.
- pyhsmm - Bayesian inference in HSMMs and HMMs.
-
GPyTorch - A highly efficient and modular implementation of Gaussian Processes in PyTorch.
-
MXFusion - Modular Probabilistic Programming on MXNet.
-
sklearn-crfsuite - A scikit-learn inspired API for CRFsuite.
Genetic Programming
-
gplearn - Genetic Programming in Python.
- DEAP - Distributed Evolutionary Algorithms in Python.
-
karoo_gp - A Genetic Programming platform for Python with GPU support.
- monkeys - A strongly-typed genetic programming framework for Python.
-
sklearn-genetic - Genetic feature selection module for scikit-learn.
Optimization
- Spearmint - Bayesian optimization.
-
BoTorch - Bayesian optimization in PyTorch.
- scikit-opt - Heuristic Algorithms for optimization.
- SMAC3 - Sequential Model-based Algorithm Configuration.
- Optunity - Is a library containing various optimizers for hyperparameter tuning.
- hyperopt - Distributed Asynchronous Hyperparameter Optimization in Python.
-
hyperopt-sklearn - Hyper-parameter optimization for sklearn.
-
sklearn-deap - Use evolutionary algorithms instead of gridsearch in scikit-learn.
-
sigopt_sklearn - SigOpt wrappers for scikit-learn methods.
- Bayesian Optimization - A Python implementation of global optimization with gaussian processes.
- SafeOpt - Safe Bayesian Optimization.
-
scikit-optimize - Sequential model-based optimization with a
scipy.optimize
interface. - Solid - A comprehensive gradient-free optimization framework written in Python.
- PySwarms - A research toolkit for particle swarm optimization in Python.
- Platypus - A Free and Open Source Python Library for Multiobjective Optimization.
-
GPflowOpt - Bayesian Optimization using GPflow.
- POT - Python Optimal Transport library.
- Talos - Hyperparameter Optimization for Keras Models.
- nlopt - Library for nonlinear optimization (global and local, constrained or unconstrained).
Time Series
-
sktime - A unified framework for machine learning with time series.
- darts - A python library for easy manipulation and forecasting of time series.
- statsforecast - Lightning fast forecasting with statistical and econometric models.
- mlforecast - Scalable machine learning based time series forecasting.
- neuralforecast - Scalable machine learning based time series forecasting.
-
tslearn - Machine learning toolkit dedicated to time-series data.
-
tick - Module for statistical learning, with a particular emphasis on time-dependent modelling.
- greykite - A flexible, intuitive and fast forecasting librarynext.
- Prophet - Automatic Forecasting Procedure.
- PyFlux - Open source time series library for Python.
- bayesloop - Probabilistic programming framework that facilitates objective model selection for time-varying parameter models.
- luminol - Anomaly Detection and Correlation library.
- dateutil - Powerful extensions to the standard datetime module
- maya - makes it very easy to parse a string and for changing timezones
- Chaos Genius - ML powered analytics engine for outlier/anomaly detection and root cause analysis
Natural Language Processing
- NLTK - Modules, data sets, and tutorials supporting research and development in Natural Language Processing.
- CLTK - The Classical Language Toolkik.
- gensim - Topic Modelling for Humans.
- PSI-Toolkit - A natural language processing toolkit.
- pyMorfologik - Python binding for Morfologik.
-
skift - Scikit-learn wrappers for Python fastText.
- Phonemizer - Simple text to phonemes converter for multiple languages.
- flair - Very simple framework for state-of-the-art NLP.
- spaCy - Industrial-Strength Natural Language Processing.
Computer Audition
- librosa - Python library for audio and music analysis.
- Yaafe - Audio features extraction.
- aubio - A library for audio and music analysis.
- Essentia - Library for audio and music analysis, description and synthesis.
- LibXtract - A simple, portable, lightweight library of audio feature extraction functions.
- Marsyas - Music Analysis, Retrieval and Synthesis for Audio Signals.
- muda - A library for augmenting annotated audio data.
- madmom - Python audio and music signal processing library.
Computer Vision
- OpenCV - Open Source Computer Vision Library.
- scikit-image - Image Processing SciKit (Toolbox for SciPy).
- imgaug - Image augmentation for machine learning experiments.
- imgaug_extension - Additional augmentations for imgaug.
- Augmentor - Image augmentation library in Python for machine learning.
- albumentations - Fast image augmentation library and easy to use wrapper around other libraries.
Statistics
-
pandas_summary - Extension to pandas dataframes describe function.
-
Pandas Profiling - Create HTML profiling reports from pandas DataFrame objects.
- statsmodels - Statistical modeling and econometrics in Python.
-
stockstats - Supply a wrapper
StockDataFrame
based on thepandas.DataFrame
with inline stock statistics/indicators support. - weightedcalcs - A pandas-based utility to calculate weighted means, medians, distributions, standard deviations, and more.
- scikit-posthocs - Pairwise Multiple Comparisons Post-hoc Tests.
- Alphalens - Performance analysis of predictive (alpha) stock factors.
Distributed Computing
-
Horovod - Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
-
PySpark - Exposes the Spark programming model to Python.
- Veles - Distributed machine learning platform.
- Jubatus - Framework and Library for Distributed Online Machine Learning.
- DMTK - Microsoft Distributed Machine Learning Toolkit.
- PaddlePaddle - PArallel Distributed Deep LEarning.
-
dask-ml - Distributed and parallel machine learning.
- Distributed - Distributed computation in Python.
Experimentation
- envd - 🏕️ machine learning development environment for data science and AI/ML engineering teams.
- Sacred - A tool to help you configure, organize, log and reproduce experiments.
- Xcessiv - A web-based application for quick, scalable, and automated hyperparameter tuning and stacked ensembling.
- Persimmon - A visual dataflow programming language for sklearn.
-
Ax - Adaptive Experimentation Platform.
- Neptune - A lightweight ML experiment tracking, results visualization and management tool.
Evaluation
- recmetrics - Library of useful metrics and plots for evaluating recommender systems.
- Metrics - Machine learning evaluation metric.
-
sklearn-evaluation - Model evaluation made easy: plots, tables and markdown reports.
- AI Fairness 360 - Fairness metrics for datasets and ML models, explanations and algorithms to mitigate bias in datasets and models.
Computations
- numpy - The fundamental package needed for scientific computing with Python.
-
Dask - Parallel computing with task scheduling.
- bottleneck - Fast NumPy array functions written in C.
- CuPy - NumPy-like API accelerated with CUDA.
- scikit-tensor - Python library for multilinear algebra and tensor factorizations.
- numdifftools - Solve automatic numerical differentiation problems in one or more variables.
- quaternion - Add built-in support for quaternions to numpy.
- adaptive - Tools for adaptive and parallel samping of mathematical functions.
Spatial Analysis
Quantum Computing
- PennyLane - Quantum machine learning, automatic differentiation, and optimization of hybrid quantum-classical computations.
- QML - A Python Toolkit for Quantum Machine Learning.
Conversion
- sklearn-porter - Transpile trained scikit-learn estimators to C, Java, JavaScript and others.
- ONNX - Open Neural Network Exchange.
- MMdnn - A set of tools to help users inter-operate among different deep learning frameworks.
Contributing
Contributions are welcome! :sunglasses: Read the contribution guideline.
License
This work is licensed under the Creative Commons Attribution 4.0 International License - CC BY 4.0
Deprecated Libs
Waiting Room