ML-Tech-Cheatsheet
ML-Tech-Cheatsheet copied to clipboard
"Cheatsheet" repository which uses advanced libraries and features.
ML-Tech-Cheatsheet 📄
Personal "cheatsheet" repository for my ideal machine learning tech-stack. I use this repository to play around and familiarize with ML libraries, advanced git and GitHub features, virtualization and so on 🤓.
Table of Contents 📜
IDEs plugins 🧰
- VSCode
- PyCharm
Machine Learning Libraries 🤖
- The classics
- Pytorch, Lightning and W&Bs
- xformers
- transformers
- DeepSpeed
- Colossal-AI
- spaCy
- nvidia-ml-py3
- albumentations
- augly
- einops
- bitsandbytes
- vLLM
- SkyPilot
Scientific Libraries
- Hydra
- SciencePlots
Environments 🌎
- libraries
- conda
- Docker
CLI Utilities 👨💻
High Performance Computing 🦾
- slurm
Git 🐱
- Protected Branches
- Tags and Releases
- LFS
- Hidden Directory
- CircleCI
- GitHub Actions
- GitHub Pages
- Others
Web development 🌐
- Prototyping
- Frontend
- Backend
- APIs
- Database
- Devops
IDEs plugins
VSCode
- Python
- RainbowCSV
- Remote
- CoPilot
- GitLens
- Docker
- Jupiter
- Gitignore
- vscode-pdf
PyCharm
- GitToolBox
- CoPilot
- Docker
Machine Learning Libraries
The classics
- NumPy - Math operations, manipulations, linear algebra and more.
- Pandas - Tabular data management.
- MatplotLib and Seaborn - All sorts of plots.
- OpenCV2, Pillow, and Sci-Kit Image - Image manipulation
Pytorch, Lightning and W&Bs
PyTorch is currently the reference ML framework for Python.
Weights and Biases (W&B) allows to easily track experiments, performances, parameters and so on in a single place.
PyTorch Lightning gets rid of most of the usual PyTorch boilerplate code, like train/val/test loops, backward and optim steps and so on. It also allows to easily use powerful pytorch features and other libraries (like W&B) by inserting just few optional parameters here and there.
xformers
xformers allows you to define transformer architecture easily. It also features the latest and hottest techniques.
transformers
HuggingFace🤗 allows to easily download, fine-tune and deploy pre-trained transformer models across a multitude of applications. It is also possible to share models and datasets on the platform, as well as "spaces" which are interactive live demos of the capabilities of the created models.
Related libraries:
- Datasets provides efficient loading of custom or common dataset samples (even online).
- Diffusers is HuggingFace🤗 package for diffusion models specifically. It comes with pre-trained SOTA model for vision and audio generation.
- timm provides a multitude of pre-trained and vanilla image models.
- Safetensors is HuggingFace🤗 package which allows storing tensors in a safe way (unlike with pickle files).
- accelerate takes care of automatically finding the best available device for training (PyTorch).
- optimum provides multiple features to accelerate training and inference
- tokenizers provides features to simply carry-out popular tokenizations.
- evaluate allows to evaluate and compare trained models.
- peft (Parameter-Efficient Fine-Tuning) provides implementations of algorithms like LORA, which allow to speed up fine-tuning while saving memory consumption.
- xformers provides optimized implementation of all operations carried-out in transformers (e.g. Memory Efficient Attention).
DeepSpeed
DeepSpeed allows for distributed high-performance and efficient training. DeepSpeed is supported in PyTorch Lightning.
Colossal AI
Colossal-AI is a framework that improves the efficiency and speed of large model training, especially for HPC clusters.
spaCy
Spacy offers a multitude of features and pre-trained pipelines for NLP tasks (like huggingface, but just for NLP).
nvidia-ml-py3
This library allows to access information about NVIDIA GPUs directly in python code.
albumentations
All sorts of popular image augmentations, like ColorJitter, ZoomBlur, Gaussian Noise... are implemented by albumentations.
augly
Data augmentation library for text, sound, image and video.
einops
Manipulation of tensors (reshaping, concatenating, ...) with einops is extremely intuitive and time-saving.
bitsandbytes
bitsandbytes allow to run training using 8-bit precision. Particularly useful to fine-tune very large models.
vllm
vllm is a high-level library to efficiently run inference of LLMs.
skypilot
skypilot allows to easily run inference of LLM and more on any cloud platform (Google, AWS, Azure, ...).
Scientific libraries
Hydra
Hydra allows to set multiple configurations smoothly as well as defining custom CLI commands. Similar to jsonargpase and LightningCLI
SciencePlots
SciencePlots allows to plot much nicer plots than classic matplotlib and seaborn.
Environments
libraries
python-dotenv allows to define and read environmental variables from a file.
yacs allows to manage configurations such as hyperparameters for experiments.
poetry allows for easy dependency management and packaging.
conda
Conda allows to easily create and share virtual environments. The
command conda env export > environment.yml
creates a .yml file that can be used to create an identical virtual
environment.
Docker
Docker allows to emulate a whole operating system.
CLI Utilities
Terminals
Hyper.js, Alacritty and Kitty among the most popular terminals in r/unixporn and are compatible with all OSs.
iTerm2 is a MacOS-only terminal emulator with lots of functionalities.
Oh My Zsh is available on Unix-like machines. It provides terminal plug-ins and themes.
tmate allows to connect via SSH to custom machine not "out in the internet". A sort of TeamViewer for ssh.
rich is a library to create amazing looking CLIs.
yabai together with skhd allows to have a nice window manager-like experience on MacOS.
Commands and utils
-
~/.ssh/config
and~/.ssh/authorized_keys
files to define known host names and authorized ssh keys. -
nvidia-smi
➡️ Check NVIDIA Cards current status -
ps
,top
,htop
➡️ Check currently running processes -
bpytop
- Likehtop
, but better. -
nvitop
➡️ Likenvidia-smi
, but better. -
tmux
➡️ Terminal multiplexer, allows to easily detach jobs. - Fig ➡️ Intellisense (and much more) for command line commands.
- sshfs ➡️ allows to mount file systems over ssh.
-
ranger ➡️ CLI browser with possible image preview on terminals like
kitty
and installingw3m
.
High Performance Computing 🦾
slurm
HPC clusters typically use a cluster management and job scheduling tool. Slurm allows to schedule jobs, handle priorities, design partitions and much more. Cheatsheet files for slurm are under the /slurm folder. The library submitit allows to switch seamlessly between executing on Slurm or locally.
Git
Taking the time to go through most of GitHub's Documentation at least once is very important. Here's a few features to keep in mind.
Protected Branches
Protected branches prevent code to be pushed onto custom branches.
Tags and Releases
Important commits can be tagged. Then, jumping to a tagged commit is easy as:
git checkout $tag-name
LFS
Git Large File System allows to push bigger files to the GitHub repository. Careful: There is a global usage quota per GitHub account that goes across repositories.
Hidden Directory
The .github
directory allows to keep the landing page of the GitHub repository "clean" and includes:
- CONTRIBUTING.md ➡️ Guidelines to contribute to the repository.
- ISSUE_TEMPLATE.md ➡️ Template for issues.
- PULL_REQUEST_TEMPLATE.md ➡️Template for pull requests.
- README.md ➡️Repository's README (i.e. this) file.
- workflows ➡️Directory which contains .yaml files for GitHub actions.
CircleCI
CircleCI hosts CI/CD pipelines and workflows, similarly to GitHub Actions.
GitHub Actions
GitHub Actions allows to execute custom actions automatically upon some triggers by some events (pull requests, pushes, issues opened, ...).
GitHub Pages
GitHub Pages allows to host a webpage for each GitHub repository.
Others
GitBook allows to simply create a documentation starting from a GitHub repository.
Pre-commit allows to create customized pre-commit hooks to, e.g., run formatting or testing before committing. Some nice things to include there:
- Black formats Python files compliantly to PEP 8.
- autopep8 allows to automatically format files to be compliant with PEP 8.
- yapf is like autopep8, but with a search algorithm for the best possible formatting.
- isort automatically sorts order of import instructions in python files.
- flake8 uses other tools to check for python errors (pyflakes), correct use of PEP conventions and others.
- pylint, similarly to flake, analyzes the code and checks for errors without actually running it.
- ruff is yet another python linter that can replace isort, flake8 and autoflake. It also extremelly fast.
- mypy is a type checker that can also be used to convert regular python to statically typed code.
Shields.io allows to put neat banner in README files, such as the number of of the repository.
Web development
I find it extremelly satisfying to build an actual prototype or product out of a Machine Learning project. Here's my favourite options:
Prototyping
To quickly create interactive apps based on trained machine learning models, gradio and streamlit are among the most popular frameworks. While it is easy to prototype using these frameworks, more complex applications are better built with a more complete stack. Figma is currently the best tool I could find to design an app / website.
Frontend
On the frontend, NextJS is one of the most popular frameworks. It builds on top of the React framework and provides additional functionalities and optimizations. Tailwindcss allows for easy styling without the need for css style sheets. Chakra-UI comes with pre-built and nice looking components. It also offers support for dark mode.
Backend
Since we are interested in Machine Learning applications, it makes sense to pick a python backend.
FastAPI is a python backend extremelly simple to set-up and highly optimized for speed. Django and Flask are more popular frameworks. Django is a full-stack meant for big projects with a clearly defined structure, whereas flask is lightweight and meant for smaller projects.
APIs
Auth0 allows for authentication and authorization. Stripe is a popular tool to deal with payments. Testing APIs is easily done with Postman.
Database
MySQL, PostgreSQL, Redis and MongoDB and are all very valid and popular databases.
PostgreSQL is preferable over MySQL for its better support for JSON data. Redis is a key-value database, which is very fast and useful for caching. MongoDB is a document-oriented database, which is very flexible and easy to use.
Prisma is a nodejs database toolkit compatible with MySQL, PostgreSQL, SQLite and SQL server. It allows to easily create and manage databases.
Devops
Applications can be hosted on a number of services. Heroku, DigitalOcean, AWS, Google Cloud and Microsoft Azure are among the most popular solutions.
Bonus 🎁
Here's a few things that are not really ML-related but that I use in my work environment and find that are worth mentioning.
Gnome-look.org offers a variety of themes for Linux machines. My personal favourite is the orchit gtk theme.
Window managers allow to customize the look and feel of the desktop environment while making development more efficient (the idea is that you should never take your hands off the keyboard). I use i3, which is one of the most popular window managers for Linux.
Iriun allows to use an iPhone or iPad as a webcam for a Linux or Windows machine, while UxPlay allows to do screen-mirroring of iPhone and iPad devices. Both are super useful for presentations, meetings, recording videos and so on.
Notion is possibly the best note-taking app out there. Full stop.
Clockify allows you to track the time spent on different projects. It is useful to stay aware of your productivity.
A few very helpful chrome extensions are: NordPass, Acrobat Reader, and Grammarly.