Add example which implements Vision Transformer(ViT) image classification

Open staghado opened this issue 2 years ago • 1 comments

I have been working on implementing a ViT model using ggml here : vit.cpp. It is still WIP but most of the work is done. It is highly inspired by the SAM example.

The conversion script supports models from the timm library and converts them to ggml format. It's working for different model sizes : tiny, small, base and huge.

here is an example inference(tiny model) :

Input : magpie

$ time ./bin/vit -t 4 -m ../ggml-model-f16.bin -i ../assets/magpie.jpeg 
main: seed = 1700560612
main: n_threads = 4 / 8
vit_model_load: loading model from '../ggml-model-f16.bin' - please wait
vit_model_load: hidden_size            = 192
vit_model_load: num_hidden_layers      = 12
vit_model_load: num_attention_heads    = 3
vit_model_load: patch_size             = 16
vit_model_load: img_size               = 384
vit_model_load: num_classes            = 1000
vit_model_load: ftype                  = 1
vit_model_load: qntvr                  = 0
operator(): ggml ctx size =  11.41 MB
vit_model_load: ................... done
vit_model_load: model size =    11.32 MB / num tensors = 152
main: loaded image '../assets/magpie.jpeg' (500 x 470)
vit_image_preprocess: scale = 1.302083
processed, out dims : (384 x 384)
main: Initialized context = 3145728 bytes
main: 
 Prediction = 18,
 Label = magpie,
 Probability = 0.865306


main:    load time       =    15.65 ms
main:    processing time =   427.55 ms
main:    total time      =   443.20 ms

real	0m0,451s
user	0m1,506s
sys	0m0,020s

Would be great if it could be added in the examples for the community to try!

Nov 21 '23 10:11 staghado

Thanks! Looks interesting - will give it a try tomorrow and share it around

Nov 21 '23 21:11 ggerganov