llama.cpp llama : second attempt to refactor vision API

Fix #8010

Supersede #9687

To test this, please refer to #9687 to convert the model to GGUF.

Then,

cmake --build build -j --target llama-vision
./build/bin/llama-vision -m ../models/llava-1.5-7b-hf/model.gguf --image ../models/bliss.png

# The image showcases a lush green field with a hill in the background. In the foreground, there is a large,
# bright, and vibrant green field with a Microsoft Windows XP desktop screen, possibly representing a
# screensaver, superimposed onto the scene. The field is expansive and covers most of

Goals of this PR:

[ ] Have the first version of public API for llama_vision
[x] Support llava, mobilevlm, minicpm-v 2.6, smolVLM
[ ] See how API can adapt to use with encoder-decoder like llama 3.2 vision (so we can add it soon)
[ ] Add API to format the chat, equivalent to Processor class on HF library
[ ] See how quantizing affect the performance

Things that will be done in follow-up PRs:

Models with encoder-decoder arch like llama 3.2 vision
GPU support
Better image processing function: faster resize function, maybe even abstract out the image transformations and optimize it (example: if we run resize twice, better to detect that and only run it once)
Further clean up the mess in convert-hf-to-gguf python script

Jan 18 '25 19:01 ngxson

Hi @ggerganov @slaren , I would like to ask for an early review from you before proceeding further.

What will be interesting to discuss here is the usage of the new API, as demo in the newly added llama-vision example. The idea is:

Call llama_vision_encode for each image (we don't support batching for now, to simplify the implementation)
Then, get the output embedding ggml_tensor and add it to llama_batch, then llama_decode it.

I'm already be able to make llava and mobilevlm working with llama-vision and convert_hf_to_gguf.py (for minicpm-v, I'm still struggling with it because the conversion is not straight-forward)

Things that are different from the initial discussion in #8010 :

I added a helper function llama_batch_get_one_from_tensor for creating the batch from a tensor, with appropriate n_past (for placing these tokens in the correct place in chat template), and seq_id for future usage in server.
llama_vision_patches actually contains slices of image, not patches, as explained in llava-uhd. The patches are actually produced in clip_image_build_graph by doing a ggml_conv_2d. I think I'll need to rename it to llama_vision_slices, but I actually prefer a more appropriate name like llama_vision_preprocessed_img since we do more than just slicing it (i.e. resize, padding, etc) - feel free to suggest if you have any ideas.

And things that are still messy and will need more works:

Naming, most functions are still prefixed by clip_ and I don't know if I should prefix everything with llama_vision_clip_ or not. Please let me know what's your preference.
Chat template support, we may need to introduce a new API that wraps the llama_chat_apply_template, much like how on transformers, they have Processor class that wraps around Tokenizer
Not sure how this API will be adapted for encoder-decoder arch like llama 3.2 vision. In theory, llama_vision_get_output_tensor should become a no-op, but judging from this implementation, it's still needed. @danbev do you have any ideas?

I would love to hear your opinions about this. Thank you!

Jan 19 '25 22:01 ngxson

llama_vision_patches actually contains slices of image, not patches, as explained in llava-uhd. The patches are actually produced in clip_image_build_graph by doing a ggml_conv_2d. I think I'll need to rename it to llama_vision_slices, but I actually prefer a more appropriate name like llama_vision_preprocessed_img since we do more than just slicing it (i.e. resize, padding, etc) - feel free to suggest if you have any ideas.

I am just wondering, is there any reason to expose the patches/slices to the user at all? Can the user do anything with the patches other than just immediately call llama_vision_encode and throw them away? If not, then maybe that could be hidden entirely from the user and llama_vision_encode could take directly an image.

Jan 20 '25 00:01 slaren

@ngxson I'll take a closer look at this today and specifically how about how this could work with a cross-attention model like Llama 3.2 Vision :+1:

One thing that is related to this work is something we discussed about how these models should be provided. I initially though that creating a single .gguf for Llama 3.2 which contained both the vision encoder and the language model would be the way to go, but as can be read in the linked discussion having separate models is probably a better solution. It would be great to get some clarification regarding this and if vision encoders should be separate .gguf models. I'm looking at updating the conversion for Llama 3.2 and make changes to convert_hf_to_gguf.py to produce 2 models (vision encoder, and language model) instead of one. I'd like to try this out with this latest vision api proposal but I'd prefer to know what the model(s) should look like before proceeding to not waste time.

Jan 20 '25 06:01 danbev

@slaren In my first proposal, I made llama_vision_encode to directly accept an image. But then I decide to split it into postprocess-encode because:

The most important reason is because user will be able to retrieve the number of tokens that the image occupies (this can varies depends on image size, in case of llava-uhd). This should be done before any decode/encode so that the user can leave the appropriate places for the image after the tokenizing step. This is also similar to Processor class on HF transformers where it returns a preprocessed image and the tokenized prompt with correct number of tokens "placeholder" for image embd.
Second reason is that by making this a dedicated function, it's easier to manage error codes. This is mostly because this function work at pixel level, not tensor level.
And third reason is because this preprocessing is indeed thread-safe, so for example, llama-server can do this step in HTTP thread, much like how llama_tokenize is currently done in HTTP thread.

Jan 20 '25 09:01 ngxson

Btw I have been repeatedly mentioned about Processor, so I think it's better to give an example of how it works: https://gist.github.com/ngxson/ca46c72f0cc7b441c30dd85c2a24ee62

Jan 20 '25 10:01 ngxson

@ngxson Sorry about the delay. I've been able to "force" support for mllama using the latest vision api, that is get an example working. I'm now going to iterate on this and try to figure out how cross attention will work. Just wanted to let you know that some progress is being made.

There is an issue I'm having with the vocab size which I'm not exactly sure how to handle. If anyone has some thoughts around this please let me know.

Jan 22 '25 16:01 danbev

@danbev No worries, I was busy with minicpm-v too. It's still not working now (inference works, but just missing the llava-uhd preprocessor). Will have a look on your implementation of mllama very soon.

Jan 22 '25 21:01 ngxson

So, minicpm-v template is more complicated because it contains bot the image and all the slices. Here is what it looks like in minicpmv-cli:

<image> (if no slice, we only have one image) </image><slice><image> (first slice) </image><image> (second slice) </image> .... (n-th slice) </slice>

To get rid of this complication, my idea is to have the embeddings of these tokens (<image>, </image>, <slice> and </slice>) appended into the output tensor returned fromllama_vision_encode.

This will make this formatting transparent to the text tokenizer, but will require embeddings of these tokens to be stored as one-hot vectors in the vision model (of course we can use ggml_get_rows to get them, but will be quite messy)

Jan 22 '25 22:01 ngxson

Ok so I managed to get minicpm-v kinda work out of the box with the API (no changes to user-space code is required).

Upon giving it win XP wallpaper bliss, it says: I see a serene landscape featuring a vast expanse of green grass under a clear blue sky

It currently operates with a resized version of the image (like llava), so the performance will be bad for bigger images (with more details). I'll get llava-uhd to work, which breaks the image into slices and thus allow the LLM to "see" the image at different zoom level, thus preserving details.

Jan 23 '25 11:01 ngxson

I also got SmolVLM (tested with 500M model) to work with this API without any change to user-space code. The image preprocessor may not be 100% correct, but I'll discuss with SmolVLM team to learn more about it.

For the bliss.jpg test:

The image depicts a wide, rolling green field with a clear blue sky and scattered clouds. The field is expansive, stretching horizontally across the image, and it is lush with grass, indicating a temperate climate with ample rainfall. The grassy areas are uniformly green, suggesting it is a well-maintained field, possibly cultivated for recreational [...]

Jan 23 '25 14:01 ngxson

So what remains to be done before this PR can be successfully merged?

Feb 03 '25 18:02 wbraswell

I'm going back to this PR, my goals for this week are:

[x] Implement the public API for separated llama_context and llama_vision_context - please only expect the public API for now, since the implementation behind is currently depending on https://github.com/ggerganov/llama.cpp/pull/11213
[ ] Clean up the llama_vision_tokens struct, while adding functions to retrieve the image cols and rows (related to llava-uhd preprocessor). Ref this discussion: https://github.com/ggerganov/llama.cpp/pull/11513#discussion_r1938234810
[ ] Fix problem with llama_batch cannot cut the input embd_tensor in half, in case it does not fit into the batch, ref: https://github.com/danbev/learning-ai/discussions/8#discussioncomment-11959937

Feb 04 '25 17:02 ngxson

@ngxson Sounds good! So I guess this puts us back on the path to re-enabling multimodal?

Feb 04 '25 18:02 wbraswell

@ngxson looks very promising! I wanted to try out your fork locally, however perhaps there were some changes since you created the PR description?

cmake --build build -j --target llama-vision 
gmake: *** No rule to make target 'llama-vision'.  Stop.

llama.cpp main branch builds fine, following this build instructions: https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md.

Feb 06 '25 21:02 AIWintermuteAI

Sounds good! So I guess this puts us back on the path to re-enabling multimodal?

yes

I wanted to try out your fork locally, however perhaps there were some changes since you created the PR description?

reconfigure your cmake, cmake -B build ...

Feb 06 '25 21:02 ngxson

@ngxson Thanks for the fast reply! Actually, I just forgot to git switch xD working late evening. It compiles fine now!

I'm interested in running SmolVLM in particular, will try digging into https://github.com/ggerganov/llama.cpp/pull/9687 to see how can I convert the model over the weekend.

Feb 06 '25 22:02 AIWintermuteAI

I'm interested in running SmolVLM in particular

SmolVLM 500M can already be run via the current PR, you should base on this, not the other one.

Feb 06 '25 22:02 ngxson

First of all, great work! Just wanted to know How did you convert the SmolVLM to GGUF? Cause when I tried I got this error:

ubuntu@deepstream-7-base:~/llama-vision/llama.cpp$ python3 convert_hf_to_gguf.py ../SmolVLM-500M-Instruct/
INFO:hf-to-gguf:Loading model: SmolVLM-500M-Instruct
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model part 'model.safetensors'
INFO:hf-to-gguf:output.weight,               torch.bfloat16 --> F16, shape = {960, 49280}
INFO:hf-to-gguf:v.mmproj.fc.weight,          torch.bfloat16 --> F16, shape = {12288, 960}
Traceback (most recent call last):
  File "/home/ubuntu/llama-vision/llama.cpp/convert_hf_to_gguf.py", line 5421, in <module>
    main()
  File "/home/ubuntu/llama-vision/llama.cpp/convert_hf_to_gguf.py", line 5415, in main
    model_instance.write()
  File "/home/ubuntu/llama-vision/llama.cpp/convert_hf_to_gguf.py", line 479, in write
    self.prepare_tensors()
  File "/home/ubuntu/llama-vision/llama.cpp/convert_hf_to_gguf.py", line 1843, in prepare_tensors
    super().prepare_tensors()
  File "/home/ubuntu/llama-vision/llama.cpp/convert_hf_to_gguf.py", line 338, in prepare_tensors
    for new_name, data_torch in (self.modify_tensors(data_torch, name, bid)):
  File "/home/ubuntu/llama-vision/llama.cpp/convert_hf_to_gguf.py", line 1811, in modify_tensors
    return [(self.map_tensor_name(name), data_torch)]
  File "/home/ubuntu/llama-vision/llama.cpp/convert_hf_to_gguf.py", line 238, in map_tensor_name
    raise ValueError(f"Can not map tensor {name!r}")
ValueError: Can not map tensor 'model.text_model.embed_tokens.weight'

Feb 08 '25 13:02 agNihit928

Looking forward to seeing this get merged, a couple of other PRs seem depends on it.

Feb 10 '25 18:02 liyimeng

@agNihit928 I think something gets buggy when I rebase to latest master, you can maybe go back to https://github.com/ggerganov/llama.cpp/pull/11292/commits/c3a654c0fbad4c7eeeaf669fc708d40aef6f341c to see if it works.

Feb 15 '25 14:02 ngxson

Sure @ngxson Will check it out Thanks

Feb 15 '25 15:02 agNihit928

It looks like SmolVLM conversion is broken in quite a few places atm.

(ml) dmitrymaslov@DmitryMT15AML86 llama.cpp % python convert_hf_to_gguf.py ../SmolVLM-Instruct
INFO:hf-to-gguf:Loading model: SmolVLM-Instruct
Traceback (most recent call last):
  File "/Users/dmitrymaslov/miniconda3/envs/ml/lib/python3.10/site-packages/transformers/utils/hub.py", line 342, in cached_file
    resolved_file = hf_hub_download(
  File "/Users/dmitrymaslov/miniconda3/envs/ml/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn
    validate_repo_id(arg_value)
  File "/Users/dmitrymaslov/miniconda3/envs/ml/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id
    raise HFValidationError(
huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/fsx/m4/experiments/local_experiment_dir/s3_async_temporary_checkpoint_folder/tr_324_opt_400/unwrapped_model'. Use `repo_type` argument if needed.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/dmitrymaslov/llama.cpp/convert_hf_to_gguf.py", line 5426, in <module>
    main()
  File "/Users/dmitrymaslov/llama.cpp/convert_hf_to_gguf.py", line 5394, in main
    hparams = Model.load_hparams(dir_model)
  File "/Users/dmitrymaslov/llama.cpp/convert_hf_to_gguf.py", line 520, in load_hparams
    text_config = AutoConfig.from_pretrained(text_config["_name_or_path"]).to_dict()
  File "/Users/dmitrymaslov/miniconda3/envs/ml/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 1075, in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/Users/dmitrymaslov/miniconda3/envs/ml/lib/python3.10/site-packages/transformers/configuration_utils.py", line 594, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/Users/dmitrymaslov/miniconda3/envs/ml/lib/python3.10/site-packages/transformers/configuration_utils.py", line 653, in _get_config_dict
    resolved_config_file = cached_file(
  File "/Users/dmitrymaslov/miniconda3/envs/ml/lib/python3.10/site-packages/transformers/utils/hub.py", line 408, in cached_file
    raise EnvironmentError(
OSError: Incorrect path_or_model_id: '/fsx/m4/experiments/local_experiment_dir/s3_async_temporary_checkpoint_folder/tr_324_opt_400/unwrapped_model'. Please provide either the path to a local folder or the repo_id of a model on the Hub.

This is due to https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct/blob/7fb3550ea09521c12c2d17026b537f86e083e8aa/config.json#L12 Changing that line to "_name_or_path": "HuggingFaceTB/SmolVLM-Instruct", brings another error

(ml) dmitrymaslov@DmitryMT15AML86 llama.cpp % python convert_hf_to_gguf.py ../SmolVLM-Instruct
INFO:hf-to-gguf:Loading model: SmolVLM-Instruct
image_token_id
use_cache
tie_word_embeddings
vision_config
text_config
scale_factor
return_dict
output_hidden_states
output_attentions
torchscript
torch_dtype
use_bfloat16
tf_legacy_loss
pruned_heads
chunk_size_feed_forward
is_encoder_decoder
is_decoder
cross_attention_hidden_size
add_cross_attention
tie_encoder_decoder
max_length
min_length
do_sample
early_stopping
num_beams
num_beam_groups
diversity_penalty
temperature
top_k
top_p
typical_p
repetition_penalty
length_penalty
no_repeat_ngram_size
encoder_no_repeat_ngram_size
bad_words_ids
num_return_sequences
output_scores
return_dict_in_generate
forced_bos_token_id
forced_eos_token_id
remove_invalid_values
exponential_decay_length_penalty
suppress_tokens
begin_suppress_tokens
architectures
finetuning_task
id2label
label2id
tokenizer_class
prefix
bos_token_id
pad_token_id
eos_token_id
sep_token_id
decoder_start_token_id
task_specific_params
problem_type
_name_or_path
_attn_implementation_autoset
transformers_version
image_seq_len
model_type
transformers.js_config
vocab_size
Traceback (most recent call last):
  File "/Users/dmitrymaslov/llama.cpp/convert_hf_to_gguf.py", line 5426, in <module>
    main()
  File "/Users/dmitrymaslov/llama.cpp/convert_hf_to_gguf.py", line 5406, in main
    model_instance = model_class(dir_model=dir_model, ftype=output_type, fname_out=fname_out,
  File "/Users/dmitrymaslov/llama.cpp/convert_hf_to_gguf.py", line 1656, in __init__
    super().__init__(*args, **kwargs)
  File "/Users/dmitrymaslov/llama.cpp/convert_hf_to_gguf.py", line 100, in __init__
    self.block_count = self.find_hparam(["n_layers", "num_hidden_layers", "n_layer", "num_layers"])
  File "/Users/dmitrymaslov/llama.cpp/convert_hf_to_gguf.py", line 140, in find_hparam
    raise KeyError(f"could not find any of: {keys}")
KeyError: "could not find any of: ['n_layers', 'num_hidden_layers', 'n_layer', 'num_layers']"

Debug print of keys is mine. The error is due to num_hidden_layers being inside of text_config or vision_config.

@ngxson Have you made any changes locally to accommodate that? Hopefully I can dig a bit into that this week

Feb 26 '25 21:02 AIWintermuteAI

Unfortunately, reverting back to https://github.com/ggml-org/llama.cpp/pull/11292/commits/c3a654c0fbad4c7eeeaf669fc708d40aef6f341c does not improve the situation :( will attempt to fix, but since I'm not familiar with internals of llama.cpp, might take time.

(ml) llama.cpp % python3 convert_hf_to_gguf.py ../SmolVLM-Instruct 
INFO:hf-to-gguf:Loading model: SmolVLM-Instruct
Traceback (most recent call last):
  File "/Users/dmitrymaslov/llama.cpp/convert_hf_to_gguf.py", line 5432, in <module>
    main()
  File "/Users/dmitrymaslov/llama.cpp/convert_hf_to_gguf.py", line 5412, in main
    model_instance = model_class(dir_model=dir_model, ftype=output_type, fname_out=fname_out,
  File "/Users/dmitrymaslov/llama.cpp/convert_hf_to_gguf.py", line 1628, in __init__
    super().__init__(*args, **kwargs)
  File "/Users/dmitrymaslov/llama.cpp/convert_hf_to_gguf.py", line 100, in __init__
    self.block_count = self.find_hparam(["n_layers", "num_hidden_layers", "n_layer", "num_layers"])
  File "/Users/dmitrymaslov/llama.cpp/convert_hf_to_gguf.py", line 135, in find_hparam
    raise KeyError(f"could not find any of: {keys}")
KeyError: "could not find any of: ['n_layers', 'num_hidden_layers', 'n_layer', 'num_layers']"
(ml) dmitrymaslov@Dmitry-Maslov-C02CT15AML86 llama.cpp % python3 convert_hf_to_gguf.py ../SmolVLM-Instruct
INFO:hf-to-gguf:Loading model: SmolVLM-Instruct
Traceback (most recent call last):
  File "/Users/dmitrymaslov/miniconda3/envs/ml/lib/python3.10/site-packages/transformers/utils/hub.py", line 342, in cached_file
    resolved_file = hf_hub_download(
  File "/Users/dmitrymaslov/miniconda3/envs/ml/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn
    validate_repo_id(arg_value)
  File "/Users/dmitrymaslov/miniconda3/envs/ml/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id
    raise HFValidationError(
huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/fsx/m4/experiments/local_experiment_dir/s3_async_temporary_checkpoint_folder/tr_324_opt_400/unwrapped_model'. Use `repo_type` argument if needed.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/dmitrymaslov/llama.cpp/convert_hf_to_gguf.py", line 5432, in <module>
    main()
  File "/Users/dmitrymaslov/llama.cpp/convert_hf_to_gguf.py", line 5400, in main
    hparams = Model.load_hparams(dir_model)
  File "/Users/dmitrymaslov/llama.cpp/convert_hf_to_gguf.py", line 515, in load_hparams
    text_config = AutoConfig.from_pretrained(text_config["_name_or_path"]).to_dict()
  File "/Users/dmitrymaslov/miniconda3/envs/ml/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 1075, in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/Users/dmitrymaslov/miniconda3/envs/ml/lib/python3.10/site-packages/transformers/configuration_utils.py", line 594, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/Users/dmitrymaslov/miniconda3/envs/ml/lib/python3.10/site-packages/transformers/configuration_utils.py", line 653, in _get_config_dict
    resolved_config_file = cached_file(
  File "/Users/dmitrymaslov/miniconda3/envs/ml/lib/python3.10/site-packages/transformers/utils/hub.py", line 408, in cached_file
    raise EnvironmentError(
OSError: Incorrect path_or_model_id: '/fsx/m4/experiments/local_experiment_dir/s3_async_temporary_checkpoint_folder/tr_324_opt_400/unwrapped_model'. Please provide either the path to a local folder or the repo_id of a model on the Hub.

(ml) llama.cpp % git status
HEAD detached at c3a654c0
Untracked files:
  (use "git add <file>..." to include in what will be committed)
        bliss.png

Mar 01 '25 14:03 AIWintermuteAI

This PR is only tested with SmolVLM 500M: https://huggingface.co/HuggingFaceTB/SmolVLM-500M-Instruct

If you're using another model, I don't know.

Mar 01 '25 15:03 ngxson

Btw, a small reminder so I don't forget:

[!IMPORTANT]

Please do NOT upload gguf produced via this PR on the internet. People don't know how to use it and they will complain, very annoying!

Mar 01 '25 15:03 ngxson

@AIWintermuteAI Based on my testing, I was able to generate the GGUF files for both 256M and the 500M models(of the original Hugging Face repos) with the mentioned branch, i.e., c3a654c0

Mar 01 '25 15:03 agNihit928

Ah, interesting! I was using https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct which "supposed to be" the same, but has different (broken) config.

Mar 01 '25 15:03 AIWintermuteAI

Absolutely, I'm not sharing anything, since I can't even get it to work yet xD

Mar 01 '25 15:03 AIWintermuteAI

I decided to try other models. Good news is that https://huggingface.co/mtgv/MobileVLM_V2-1.7B can be converted and runs! Bad news is that it seems it does not get the image, using bliss.png as a prompt:

(ml)  % ./build/bin/llama-vision --image bliss.png -m ../MobileVLM_V2-1.7B/MobileVLM_V2-1.7B-F16.gguf -p "What is in the image?"
build: 4677 (fa552817) with Apple clang version 16.0.0 (clang-1600.0.26.6) for x86_64-apple-darwin23.6.0
llama_model_loader: loaded meta data with 45 key-value pairs and 614 tensors from ../MobileVLM_V2-1.7B/MobileVLM_V2-1.7B-F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = MobileVLM_V2 1.7B
llama_model_loader: - kv   3:                       general.organization str              = Mtgv
llama_model_loader: - kv   4:                           general.basename str              = MobileVLM_V2
llama_model_loader: - kv   5:                         general.size_label str              = 1.7B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                               general.tags arr[str,1]       = ["MobileVLM V2"]
llama_model_loader: - kv   8:                          llama.block_count u32              = 24
llama_model_loader: - kv   9:                       llama.context_length u32              = 2048
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 2048
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 5632
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 16
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 16
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  16:                                vision.type str              = vit
llama_model_loader: - kv  17:                          vision.image_size u32              = 336
llama_model_loader: - kv  18:                          vision.patch_size u32              = 14
llama_model_loader: - kv  19:                    vision.vit.architecture str              = mobilevlm
llama_model_loader: - kv  20:                     vision.vit.block_count u32              = 24
llama_model_loader: - kv  21:                vision.vit.embedding_length u32              = 1024
llama_model_loader: - kv  22:             vision.vit.feed_forward_length u32              = 4096
llama_model_loader: - kv  23:            vision.vit.attention.head_count u32              = 16
llama_model_loader: - kv  24:                          vision.image_mean arr[f32,3]       = [0.481455, 0.457828, 0.408211]
llama_model_loader: - kv  25:                           vision.image_std arr[f32,3]       = [0.268630, 0.261303, 0.275777]
llama_model_loader: - kv  26:                    vision.vit.select_layer i32              = -2
llama_model_loader: - kv  27:                          general.file_type u32              = 1
llama_model_loader: - kv  28:                           llama.vocab_size u32              = 32000
llama_model_loader: - kv  29:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  30:                vision.vit.patch_merge_type str              = flat
llama_model_loader: - kv  31:                  vision.vit.projector_type str              = ldpv2
llama_model_loader: - kv  32:    vision.vit.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv  33:         vision.vit.max_position_embeddings u32              = 577
llama_model_loader: - kv  34:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  35:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  36:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  37:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  38:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  39:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  40:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  41:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  42:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  43:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  44:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  295 tensors
llama_model_loader: - type  f16:  319 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = F16
print_info: file size   = 3.12 GiB (16.00 BPW) 
load_hparams: loading ViT vision model
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 3
load: token to piece cache size = 0.1684 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 2048
print_info: n_embd           = 2048
print_info: n_layer          = 24
print_info: n_head           = 16
print_info: n_head_kv        = 16
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 1
print_info: n_embd_k_gqa     = 2048
print_info: n_embd_v_gqa     = 2048
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 5632
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 2048
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = ?B
print_info: model params     = 1.67 B
print_info: general.name     = MobileVLM_V2 1.7B
print_info: vocab type       = SPM
print_info: n_vocab          = 32000
print_info: n_merges         = 0
print_info: BOS token        = 1 '<s>'
print_info: EOS token        = 2 '</s>'
print_info: UNK token        = 0 '<unk>'
print_info: PAD token        = 0 '<unk>'
print_info: LF token         = 13 '<0x0A>'
print_info: EOG token        = 2 '</s>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors:   CPU_Mapped model buffer size =  3193.95 MiB
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 4096
llama_init_from_model: n_ctx_per_seq = 4096
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 1024
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 10000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_pre_seq (4096) > n_ctx_train (2048) -- possible training context overflow
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 24, can_shift = 1
llama_kv_cache_init:        CPU KV buffer size =   768.00 MiB
llama_init_from_model: KV self size  =  768.00 MiB, K (f16):  384.00 MiB, V (f16):  384.00 MiB
llama_init_from_model:        CPU  output buffer size =     0.12 MiB
llama_init_from_model:        CPU compute buffer size =   304.01 MiB
llama_init_from_model: graph nodes  = 774
llama_init_from_model: graph splits = 386 (with bs=1024), 1 (with bs=1)
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
loaded image bliss.png, size = 300 x 241
encoded image
eval text batch (7 tokens)
prompt processed, 7 tokens
</s>
(ml) dmitrymaslov@Dmitry-Maslov-C02CT15AML86 llama.cpp % ./build/bin/llama-vision --image bliss.png -m ../MobileVLM_V2-1.7B/MobileVLM_V2-1.7B-F16.gguf -p "What is in the image?"
build: 4677 (fa552817) with Apple clang version 16.0.0 (clang-1600.0.26.6) for x86_64-apple-darwin23.6.0
llama_model_loader: loaded meta data with 45 key-value pairs and 614 tensors from ../MobileVLM_V2-1.7B/MobileVLM_V2-1.7B-F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = MobileVLM_V2 1.7B
llama_model_loader: - kv   3:                       general.organization str              = Mtgv
llama_model_loader: - kv   4:                           general.basename str              = MobileVLM_V2
llama_model_loader: - kv   5:                         general.size_label str              = 1.7B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                               general.tags arr[str,1]       = ["MobileVLM V2"]
llama_model_loader: - kv   8:                          llama.block_count u32              = 24
llama_model_loader: - kv   9:                       llama.context_length u32              = 2048
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 2048
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 5632
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 16
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 16
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  16:                                vision.type str              = vit
llama_model_loader: - kv  17:                          vision.image_size u32              = 336
llama_model_loader: - kv  18:                          vision.patch_size u32              = 14
llama_model_loader: - kv  19:                    vision.vit.architecture str              = mobilevlm
llama_model_loader: - kv  20:                     vision.vit.block_count u32              = 24
llama_model_loader: - kv  21:                vision.vit.embedding_length u32              = 1024
llama_model_loader: - kv  22:             vision.vit.feed_forward_length u32              = 4096
llama_model_loader: - kv  23:            vision.vit.attention.head_count u32              = 16
llama_model_loader: - kv  24:                          vision.image_mean arr[f32,3]       = [0.481455, 0.457828, 0.408211]
llama_model_loader: - kv  25:                           vision.image_std arr[f32,3]       = [0.268630, 0.261303, 0.275777]
llama_model_loader: - kv  26:                    vision.vit.select_layer i32              = -2
llama_model_loader: - kv  27:                          general.file_type u32              = 1
llama_model_loader: - kv  28:                           llama.vocab_size u32              = 32000
llama_model_loader: - kv  29:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  30:                vision.vit.patch_merge_type str              = flat
llama_model_loader: - kv  31:                  vision.vit.projector_type str              = ldpv2
llama_model_loader: - kv  32:    vision.vit.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv  33:         vision.vit.max_position_embeddings u32              = 577
llama_model_loader: - kv  34:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  35:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  36:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  37:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  38:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  39:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  40:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  41:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  42:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  43:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  44:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  295 tensors
llama_model_loader: - type  f16:  319 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = F16
print_info: file size   = 3.12 GiB (16.00 BPW) 
load_hparams: loading ViT vision model
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 3
load: token to piece cache size = 0.1684 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 2048
print_info: n_embd           = 2048
print_info: n_layer          = 24
print_info: n_head           = 16
print_info: n_head_kv        = 16
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 1
print_info: n_embd_k_gqa     = 2048
print_info: n_embd_v_gqa     = 2048
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 5632
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 2048
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = ?B
print_info: model params     = 1.67 B
print_info: general.name     = MobileVLM_V2 1.7B
print_info: vocab type       = SPM
print_info: n_vocab          = 32000
print_info: n_merges         = 0
print_info: BOS token        = 1 '<s>'
print_info: EOS token        = 2 '</s>'
print_info: UNK token        = 0 '<unk>'
print_info: PAD token        = 0 '<unk>'
print_info: LF token         = 13 '<0x0A>'
print_info: EOG token        = 2 '</s>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors:   CPU_Mapped model buffer size =  3193.95 MiB
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 4096
llama_init_from_model: n_ctx_per_seq = 4096
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 1024
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 10000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_pre_seq (4096) > n_ctx_train (2048) -- possible training context overflow
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 24, can_shift = 1
llama_kv_cache_init:        CPU KV buffer size =   768.00 MiB
llama_init_from_model: KV self size  =  768.00 MiB, K (f16):  384.00 MiB, V (f16):  384.00 MiB
llama_init_from_model:        CPU  output buffer size =     0.12 MiB
llama_init_from_model:        CPU compute buffer size =   304.01 MiB
llama_init_from_model: graph nodes  = 774
llama_init_from_model: graph splits = 386 (with bs=1024), 1 (with bs=1)
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
loaded image bliss.png, size = 300 x 241
encoded image
eval text batch (7 tokens)
prompt processed, 7 tokens

- The image features an aerial view of a city street.
- There is a small car parked on the street.
- The car appears to be in a driving position, indicating that it is not stationary.
- The street is lined with trees, providing some greenery and shade.

Mar 01 '25 15:03 AIWintermuteAI

So, I can successfully convert and run the SmolVLM on the commit you specified. However the output is not relevant to what is in the picture...

(ml) dmitrymaslov@Dmitry-Maslov-C02CT15AML86 llama.cpp % ./build/bin/llama-vision --image bliss.png -m ../SmolVLM-500M-Instruct/SmolVLM-500M-Instruct-F16.gguf -p "What is in the image?"
build: 4542 (c3a654c0) with Apple clang version 16.0.0 (clang-1600.0.26.6) for x86_64-apple-darwin23.6.0
llama_model_loader: loaded meta data with 65 key-value pairs and 489 tensors from ../SmolVLM-500M-Instruct/SmolVLM-500M-Instruct-F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = SmolVLM 500M Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = SmolVLM
llama_model_loader: - kv   5:                         general.size_label str              = 500M
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                   general.base_model.count u32              = 2
llama_model_loader: - kv   8:                  general.base_model.0.name str              = SmolLM2 360M Instruct
llama_model_loader: - kv   9:          general.base_model.0.organization str              = HuggingFaceTB
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/HuggingFaceTB/...
llama_model_loader: - kv  11:                  general.base_model.1.name str              = Siglip Base Patch16 512
llama_model_loader: - kv  12:               general.base_model.1.version str              = 512
llama_model_loader: - kv  13:          general.base_model.1.organization str              = Google
llama_model_loader: - kv  14:              general.base_model.1.repo_url str              = https://huggingface.co/google/siglip-...
llama_model_loader: - kv  15:                      general.dataset.count u32              = 2
llama_model_loader: - kv  16:                     general.dataset.0.name str              = The_Cauldron
llama_model_loader: - kv  17:             general.dataset.0.organization str              = HuggingFaceM4
llama_model_loader: - kv  18:                 general.dataset.0.repo_url str              = https://huggingface.co/HuggingFaceM4/...
llama_model_loader: - kv  19:                     general.dataset.1.name str              = Docmatix
llama_model_loader: - kv  20:             general.dataset.1.organization str              = HuggingFaceM4
llama_model_loader: - kv  21:                 general.dataset.1.repo_url str              = https://huggingface.co/HuggingFaceM4/...
llama_model_loader: - kv  22:                               general.tags arr[str,1]       = ["image-text-to-text"]
llama_model_loader: - kv  23:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  24:                          llama.block_count u32              = 32
llama_model_loader: - kv  25:                       llama.context_length u32              = 8192
llama_model_loader: - kv  26:                     llama.embedding_length u32              = 960
llama_model_loader: - kv  27:                  llama.feed_forward_length u32              = 2560
llama_model_loader: - kv  28:                 llama.attention.head_count u32              = 15
llama_model_loader: - kv  29:              llama.attention.head_count_kv u32              = 5
llama_model_loader: - kv  30:                       llama.rope.freq_base f32              = 100000.000000
llama_model_loader: - kv  31:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  32:                 llama.attention.key_length u32              = 64
llama_model_loader: - kv  33:               llama.attention.value_length u32              = 64
llama_model_loader: - kv  34:                                vision.type str              = vit
llama_model_loader: - kv  35:                          vision.image_size u32              = 512
llama_model_loader: - kv  36:                          vision.patch_size u32              = 16
llama_model_loader: - kv  37:                    vision.vit.architecture str              = idefics3
llama_model_loader: - kv  38:                     vision.vit.block_count u32              = 12
llama_model_loader: - kv  39:                vision.vit.embedding_length u32              = 768
llama_model_loader: - kv  40:             vision.vit.feed_forward_length u32              = 3072
llama_model_loader: - kv  41:            vision.vit.attention.head_count u32              = 12
llama_model_loader: - kv  42:                          vision.image_mean arr[f32,3]       = [0.500000, 0.500000, 0.500000]
llama_model_loader: - kv  43:                           vision.image_std arr[f32,3]       = [0.500000, 0.500000, 0.500000]
llama_model_loader: - kv  44:                    vision.vit.select_layer i32              = 0
llama_model_loader: - kv  45:                          general.file_type u32              = 1
llama_model_loader: - kv  46:                           llama.vocab_size u32              = 49280
llama_model_loader: - kv  47:                 llama.rope.dimension_count u32              = 64
llama_model_loader: - kv  48:                vision.vit.patch_merge_type str              = flat
llama_model_loader: - kv  49:                  vision.vit.projector_type str              = mlp
llama_model_loader: - kv  50:                    vision.vit.scale_factor i32              = 4
llama_model_loader: - kv  51:    vision.vit.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv  52:         vision.vit.max_position_embeddings u32              = 1024
llama_model_loader: - kv  53:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  54:                         tokenizer.ggml.pre str              = smollm
llama_model_loader: - kv  55:                      tokenizer.ggml.tokens arr[str,49280]   = ["<|endoftext|>", "<|im_start|>", "<|...
llama_model_loader: - kv  56:                  tokenizer.ggml.token_type arr[i32,49280]   = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  57:                      tokenizer.ggml.merges arr[str,48900]   = ["Ġ t", "Ġ a", "i n", "h e", "Ġ Ġ...
llama_model_loader: - kv  58:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  59:                tokenizer.ggml.eos_token_id u32              = 49279
llama_model_loader: - kv  60:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  61:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  62:                    tokenizer.chat_template str              = <|im_start|>{% for message in message...
llama_model_loader: - kv  63:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  64:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  188 tensors
llama_model_loader: - type  f16:  301 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = F16
print_info: file size   = 968.30 MiB (16.01 BPW) 
load_hparams: loading ViT vision model
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 145
load: token to piece cache size = 0.3199 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 8192
print_info: n_embd           = 960
print_info: n_layer          = 32
print_info: n_head           = 15
print_info: n_head_kv        = 5
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 3
print_info: n_embd_k_gqa     = 320
print_info: n_embd_v_gqa     = 320
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 2560
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 100000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 8192
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 8B
print_info: model params     = 507.48 M
print_info: general.name     = SmolVLM 500M Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 49280
print_info: n_merges         = 48900
print_info: BOS token        = 1 '<|im_start|>'
print_info: EOS token        = 49279 '<end_of_utterance>'
print_info: EOT token        = 2 '<|im_end|>'
print_info: UNK token        = 0 '<|endoftext|>'
print_info: PAD token        = 2 '<|im_end|>'
print_info: LF token         = 143 'Ä'
print_info: EOG token        = 0 '<|endoftext|>'
print_info: EOG token        = 2 '<|im_end|>'
print_info: EOG token        = 49279 '<end_of_utterance>'
print_info: max token length = 162
load_tensors:   CPU_Mapped model buffer size =   968.30 MiB
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 4096
llama_init_from_model: n_ctx_per_seq = 4096
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 1024
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 100000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (4096) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1
llama_kv_cache_init:        CPU KV buffer size =   160.00 MiB
llama_init_from_model: KV self size  =  160.00 MiB, K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_init_from_model:        CPU  output buffer size =     0.19 MiB
llama_init_from_model:        CPU compute buffer size =   271.01 MiB
llama_init_from_model: graph nodes  = 1030
llama_init_from_model: graph splits = 514 (with bs=1024), 1 (with bs=1)
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
loaded image bliss.png, size = 300 x 241
encoded image
eval text batch (6 tokens)
prompt processed, 6 tokens

The image contains a group of people.
The image represents a family.
The family consists of two adults and two children.
The adults are standing on either side of the children.
The children are also standing and are wearing clothes.
The adults are wearing clothes and are holding hands with the children.

(ml) dmitrymaslov@Dmitry-Maslov-C02CT15AML86 llama.cpp % git status
HEAD detached at c3a654c0
Untracked files:
  (use "git add <file>..." to include in what will be committed)
        bliss.png

nothing added to commit but untracked files present (use "git add" to track)

I ran it a few times, unfortunately it cannot get it right even once bliss .

Mar 01 '25 15:03 AIWintermuteAI