ggml
ggml copied to clipboard
Add example which implements Vision Transformer(ViT) image classification
I have been working on implementing a ViT model using ggml here : vit.cpp. It is still WIP but most of the work is done. It is highly inspired by the SAM example.
The conversion script supports models from the timm library and converts them to ggml format. It's working for different model sizes : tiny, small, base and huge.
here is an example inference(tiny model) :
Input :
$ time ./bin/vit -t 4 -m ../ggml-model-f16.bin -i ../assets/magpie.jpeg main: seed = 1700560612 main: n_threads = 4 / 8 vit_model_load: loading model from '../ggml-model-f16.bin' - please wait vit_model_load: hidden_size = 192 vit_model_load: num_hidden_layers = 12 vit_model_load: num_attention_heads = 3 vit_model_load: patch_size = 16 vit_model_load: img_size = 384 vit_model_load: num_classes = 1000 vit_model_load: ftype = 1 vit_model_load: qntvr = 0 operator(): ggml ctx size = 11.41 MB vit_model_load: ................... done vit_model_load: model size = 11.32 MB / num tensors = 152 main: loaded image '../assets/magpie.jpeg' (500 x 470) vit_image_preprocess: scale = 1.302083 processed, out dims : (384 x 384) main: Initialized context = 3145728 bytes main: Prediction = 18, Label = magpie, Probability = 0.865306 main: load time = 15.65 ms main: processing time = 427.55 ms main: total time = 443.20 ms real 0m0,451s user 0m1,506s sys 0m0,020s
Would be great if it could be added in the examples for the community to try!
Thanks! Looks interesting - will give it a try tomorrow and share it around