trafficstars

ML-Tech-Cheatsheet 📄

Personal "cheatsheet" repository for my ideal machine learning tech-stack. I use this repository to play around and familiarize with ML libraries, advanced git and GitHub features, virtualization and so on 🤓.

Table of Contents 📜

IDEs plugins 🧰

VSCode
PyCharm

Machine Learning Libraries 🤖

The classics
Pytorch, Lightning and W&Bs
xformers
transformers
DeepSpeed
Colossal-AI
spaCy
nvidia-ml-py3
albumentations
augly
einops
bitsandbytes
vLLM
SkyPilot

Scientific Libraries

Hydra
SciencePlots

Environments 🌎

libraries
conda
Docker

CLI Utilities 👨‍💻

High Performance Computing 🦾

slurm

Git 🐱

Protected Branches
Tags and Releases
LFS
Hidden Directory
CircleCI
GitHub Actions
GitHub Pages
Others

Web development 🌐

Prototyping
Frontend
Backend
APIs
Database
Devops

IDEs plugins

VSCode

Python
RainbowCSV
Remote
CoPilot
GitLens
Docker
Jupiter
Gitignore
vscode-pdf

PyCharm

GitToolBox
CoPilot
Docker

Machine Learning Libraries

The classics

NumPy - Math operations, manipulations, linear algebra and more.
Pandas - Tabular data management.
MatplotLib and Seaborn - All sorts of plots.
OpenCV2, Pillow, and Sci-Kit Image - Image manipulation

Pytorch, Lightning and W&Bs

PyTorch is currently the reference ML framework for Python.

Weights and Biases (W&B) allows to easily track experiments, performances, parameters and so on in a single place.

PyTorch Lightning gets rid of most of the usual PyTorch boilerplate code, like train/val/test loops, backward and optim steps and so on. It also allows to easily use powerful pytorch features and other libraries (like W&B) by inserting just few optional parameters here and there.

xformers

xformers allows you to define transformer architecture easily. It also features the latest and hottest techniques.

transformers

HuggingFace🤗 allows to easily download, fine-tune and deploy pre-trained transformer models across a multitude of applications. It is also possible to share models and datasets on the platform, as well as "spaces" which are interactive live demos of the capabilities of the created models.

Related libraries:

Datasets provides efficient loading of custom or common dataset samples (even online).
Diffusers is HuggingFace🤗 package for diffusion models specifically. It comes with pre-trained SOTA model for vision and audio generation.
timm provides a multitude of pre-trained and vanilla image models.
Safetensors is HuggingFace🤗 package which allows storing tensors in a safe way (unlike with pickle files).
accelerate takes care of automatically finding the best available device for training (PyTorch).
optimum provides multiple features to accelerate training and inference
tokenizers provides features to simply carry-out popular tokenizations.
evaluate allows to evaluate and compare trained models.
peft (Parameter-Efficient Fine-Tuning) provides implementations of algorithms like LORA, which allow to speed up fine-tuning while saving memory consumption.
xformers provides optimized implementation of all operations carried-out in transformers (e.g. Memory Efficient Attention).

DeepSpeed

DeepSpeed allows for distributed high-performance and efficient training. DeepSpeed is supported in PyTorch Lightning.

Colossal AI

Colossal-AI is a framework that improves the efficiency and speed of large model training, especially for HPC clusters.

spaCy

Spacy offers a multitude of features and pre-trained pipelines for NLP tasks (like huggingface, but just for NLP).

nvidia-ml-py3

This library allows to access information about NVIDIA GPUs directly in python code.

albumentations

All sorts of popular image augmentations, like ColorJitter, ZoomBlur, Gaussian Noise... are implemented by albumentations.

augly

Data augmentation library for text, sound, image and video.

einops

Manipulation of tensors (reshaping, concatenating, ...) with einops is extremely intuitive and time-saving.

bitsandbytes

bitsandbytes allow to run training using 8-bit precision. Particularly useful to fine-tune very large models.

vllm

vllm is a high-level library to efficiently run inference of LLMs.

skypilot

skypilot allows to easily run inference of LLM and more on any cloud platform (Google, AWS, Azure, ...).

Scientific libraries

Hydra

Hydra allows to set multiple configurations smoothly as well as defining custom CLI commands. Similar to jsonargpase and LightningCLI

SciencePlots

SciencePlots allows to plot much nicer plots than classic matplotlib and seaborn.

Environments

libraries

python-dotenv allows to define and read environmental variables from a file.

yacs allows to manage configurations such as hyperparameters for experiments.

poetry allows for easy dependency management and packaging.

conda

Conda allows to easily create and share virtual environments. The command conda env export > environment.yml creates a .yml file that can be used to create an identical virtual environment.

Docker

Docker allows to emulate a whole operating system.

CLI Utilities

Terminals

Hyper.js, Alacritty and Kitty among the most popular terminals in r/unixporn and are compatible with all OSs.

iTerm2 is a MacOS-only terminal emulator with lots of functionalities.

Oh My Zsh is available on Unix-like machines. It provides terminal plug-ins and themes.

tmate allows to connect via SSH to custom machine not "out in the internet". A sort of TeamViewer for ssh.

rich is a library to create amazing looking CLIs.

yabai together with skhd allows to have a nice window manager-like experience on MacOS.

Commands and utils

~/.ssh/config and ~/.ssh/authorized_keys files to define known host names and authorized ssh keys.
nvidia-smi ➡️ Check NVIDIA Cards current status
ps, top, htop ➡️ Check currently running processes
bpytop - Like htop, but better.
nvitop ➡️ Like nvidia-smi, but better.
tmux ➡️ Terminal multiplexer, allows to easily detach jobs.
Fig ➡️ Intellisense (and much more) for command line commands.
sshfs ➡️ allows to mount file systems over ssh.
ranger ➡️ CLI browser with possible image preview on terminals like kitty and installing w3m.

High Performance Computing 🦾

slurm

HPC clusters typically use a cluster management and job scheduling tool. Slurm allows to schedule jobs, handle priorities, design partitions and much more. Cheatsheet files for slurm are under the /slurm folder. The library submitit allows to switch seamlessly between executing on Slurm or locally.

Git

Taking the time to go through most of GitHub's Documentation at least once is very important. Here's a few features to keep in mind.

Protected Branches

Protected branches prevent code to be pushed onto custom branches.

Tags and Releases

Important commits can be tagged. Then, jumping to a tagged commit is easy as:

git checkout $tag-name

LFS

Git Large File System allows to push bigger files to the GitHub repository. Careful: There is a global usage quota per GitHub account that goes across repositories.

Hidden Directory

The .github directory allows to keep the landing page of the GitHub repository "clean" and includes:

CONTRIBUTING.md ➡️ Guidelines to contribute to the repository.
ISSUE_TEMPLATE.md ➡️ Template for issues.
PULL_REQUEST_TEMPLATE.md ➡️Template for pull requests.
README.md ➡️Repository's README (i.e. this) file.
workflows ➡️Directory which contains .yaml files for GitHub actions.

CircleCI

CircleCI hosts CI/CD pipelines and workflows, similarly to GitHub Actions.

GitHub Actions

GitHub Actions allows to execute custom actions automatically upon some triggers by some events (pull requests, pushes, issues opened, ...).

GitHub Pages

GitHub Pages allows to host a webpage for each GitHub repository.

Others

GitBook allows to simply create a documentation starting from a GitHub repository.

Pre-commit allows to create customized pre-commit hooks to, e.g., run formatting or testing before committing. Some nice things to include there:

Black formats Python files compliantly to PEP 8.
autopep8 allows to automatically format files to be compliant with PEP 8.
yapf is like autopep8, but with a search algorithm for the best possible formatting.
isort automatically sorts order of import instructions in python files.
flake8 uses other tools to check for python errors (pyflakes), correct use of PEP conventions and others.
pylint, similarly to flake, analyzes the code and checks for errors without actually running it.
ruff is yet another python linter that can replace isort, flake8 and autoflake. It also extremelly fast.
mypy is a type checker that can also be used to convert regular python to statically typed code.

Shields.io allows to put neat banner in README files, such as the number of of the repository.

Web development

I find it extremelly satisfying to build an actual prototype or product out of a Machine Learning project. Here's my favourite options:

Prototyping

To quickly create interactive apps based on trained machine learning models, gradio and streamlit are among the most popular frameworks. While it is easy to prototype using these frameworks, more complex applications are better built with a more complete stack. Figma is currently the best tool I could find to design an app / website.

Frontend

On the frontend, NextJS is one of the most popular frameworks. It builds on top of the React framework and provides additional functionalities and optimizations. Tailwindcss allows for easy styling without the need for css style sheets. Chakra-UI comes with pre-built and nice looking components. It also offers support for dark mode.

Backend

Since we are interested in Machine Learning applications, it makes sense to pick a python backend.

FastAPI is a python backend extremelly simple to set-up and highly optimized for speed. Django and Flask are more popular frameworks. Django is a full-stack meant for big projects with a clearly defined structure, whereas flask is lightweight and meant for smaller projects.

APIs

Auth0 allows for authentication and authorization. Stripe is a popular tool to deal with payments. Testing APIs is easily done with Postman.

Database

MySQL, PostgreSQL, Redis and MongoDB and are all very valid and popular databases.

PostgreSQL is preferable over MySQL for its better support for JSON data. Redis is a key-value database, which is very fast and useful for caching. MongoDB is a document-oriented database, which is very flexible and easy to use.

Prisma is a nodejs database toolkit compatible with MySQL, PostgreSQL, SQLite and SQL server. It allows to easily create and manage databases.

Devops

Applications can be hosted on a number of services. Heroku, DigitalOcean, AWS, Google Cloud and Microsoft Azure are among the most popular solutions.

Bonus 🎁

Here's a few things that are not really ML-related but that I use in my work environment and find that are worth mentioning.

Gnome-look.org offers a variety of themes for Linux machines. My personal favourite is the orchit gtk theme.

Window managers allow to customize the look and feel of the desktop environment while making development more efficient (the idea is that you should never take your hands off the keyboard). I use i3, which is one of the most popular window managers for Linux.

Iriun allows to use an iPhone or iPad as a webcam for a Linux or Windows machine, while UxPlay allows to do screen-mirroring of iPhone and iPad devices. Both are super useful for presentations, meetings, recording videos and so on.

Notion is possibly the best note-taking app out there. Full stop.

Clockify allows you to track the time spent on different projects. It is useful to stay aware of your productivity.

A few very helpful chrome extensions are: NordPass, Acrobat Reader, and Grammarly.

ML-Tech-Cheatsheet ML-Tech-Cheatsheet copied to clipboard

Metadata

ML-Tech-Cheatsheet 📄

Table of Contents 📜

IDEs plugins

VSCode

PyCharm

Machine Learning Libraries

The classics

Pytorch, Lightning and W&Bs

xformers

transformers

DeepSpeed

Colossal AI

spaCy

nvidia-ml-py3

albumentations

augly

einops

bitsandbytes

vllm

skypilot

Scientific libraries

Hydra

SciencePlots

Environments

libraries

conda

Docker

CLI Utilities

Terminals

Commands and utils

High Performance Computing 🦾

slurm

Git

Protected Branches

Tags and Releases

LFS

Hidden Directory

CircleCI

GitHub Actions

GitHub Pages

Others

Web development

Prototyping

Frontend

Backend

APIs

Database

Devops

Bonus 🎁

← Metadata

Owner

Metadata

ML-Tech-Cheatsheet
ML-Tech-Cheatsheet copied to clipboard