vit.cpp
vit.cpp copied to clipboard
Inference Vision Transformer (ViT) in plain C/C++ with ggml
vit.cpp
The objective of the project is to create a C++ inference engine for Vision Transformer(ViT) models using ggml which focuses on performance on edge devices.
The implementation is destined to be lightweight and self-contained to be able to run it on different platforms.
Per device optimizations are possible and quantization techniques will be added soon.
[This is a work in progress]
Vision Transformer architecture
The implemented architecture is based on the original Vision Transformer from:
Convert PyTorch to GGUF
# install torch and timm
pip install torch timm
# list available models if needed
# note that not all models are supported
python convert-pth-to-ggml.py --list
# convert the weights to gguf : vit tiny with patch size of 16 and an image size of
# 384 pre-trained on ImageNet21k and fine-tuned on ImageNet1k
python convert-pth-to-ggml.py --model_name vit_tiny_patch16_384.augreg_in21k_ft_in1k --ftype 1
Build
Simple build
# build ggml and vit
mkdir build && cd build
cmake .. && make -j4
# run inference
./bin/vit -t 4 -m ../ggml-model-f16.gguf -i ../assets/tench.jpg
The optimal number of threads to use depends on many factors and more is not always better. Usually using a number of threads equal to the number of available physical cores gives the best performance in terms of speed.
Per device optimizations
Generate per-device instructions that work best for the given machine rather than using general CPU instructions. This can be done by specifying -march=native in the compiler flags.
- Multi-threading and vectorization
- Loop transformations(unrolling)
For AMD host processors
You can use a specialized compiler released by AMD to make full use of your specific processor's architecture. Read more here : AMD Optimizing C/C++ and Fortran Compilers (AOCC)
You can follow the given instructions to install the AOCC compiler.
Note : For my AMD Ryzen™ 7 3700U, the improvements were not very significant but for more recent processors there could be a gain in using a specialized compiler.
Using OpenMP
Additionally compile with OpenMP by specifying the '-fopenmp' flag to the compiler in the CMakeLists file, allowing multithreaded runs. Make sure to also enable multiple threads when running, e.g.:
OMP_NUM_THREADS=4 ./bin/vit -t 4 -m ../ggml-model-f16.bin -i ../assets/tench.jpg
Run
usage: ./bin/vit [options]
options:
-h, --help show this help message and exit
-s SEED, --seed SEED RNG seed (default: -1)
-t N, --threads N number of threads to use during computation (default: 4)
-m FNAME, --model FNAME
model path (default: ../ggml-model-f16.bin)
-i FNAME, --inp FNAME
input file (default: ../assets/tench.jpg)
-e FLOAT, --epsilon
epsilon (default: 0.000001)
Benchmark against PyTorch
First experiments on Apple M1 show inference speedups(up to 6x faster for base model) compared to native PyTorch inference. Extensive experiments will be conducted to verify this. A comparison with ONNX models will be added as well.
To-Do List
-
[ ] Implement Bicubic Interpolation:
For now the image resizing is done with bilinear interpolation but the models were tranined with bicubic interpolation, this could result in loss of performance.
-
[ ] Add quantization
- [ ] 8-bit
- [ ] 4-bit
-
[] Test the inference
- [✔] Run inference on a sample image
- [✔] Compare with PyTorch output
- [✔] Benchmark inference speed vs. PyTorch for different model sizes
Done
-
[✔] Image preprocessing
- [✔] Load the image from a file name
- [✔] Create image patches
-
[✔] Convert the PyTorch weights
- [✔] Use ggml tensor format to load the params
- [✔] Validate the weights
-
[✔] Create a ViT object
- [✔] Create a config to hold hparams
- [✔] Create a ViT struct
- [✔] ViT Encoder
- [✔] ViT Embeddings
- [✔] Patch Embeddings
- [✔] [CLS] token
- [✔] Positional Encodings
- [✔] Transformer Encoder
- [✔] Layer Norm
- [✔] Self Attention
- [✔] MLP
- [✔] Pooling
- [✔] ViT Embeddings
- [✔] Classifier
- [✔] ViT Encoder
This project was highly inspired by the following projects: