llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

support GLM-4.5V and GLM-4.1V vision models

Open ddh0 opened this issue 2 months ago • 11 comments

Add support for zai-org/GLM-4.5V and zai-org/GLM-4.1V-9B-Thinking vision models to llama.cpp. I currently only plan to support images + text, no video inputs in this PR.

The architecture is Glm4vMoeForConditionalGeneration ("model_type": "glm4v_moe") / Glm4vForConditionalGeneration ("model_type": "glm4v"). Internally, these consist of an LLM (text model) and a ViT (vision adapter / multimodal projector):

LLM

  • Based on GLM-4.5-Air / GLM-4-9B-0414
  • Tensor names start with model.language_model.
  • Uses a "multimodal 3D RoPE" - in apply_multimodal_rotary_pos_emb, it applies rotary embeddings across temporal, height, and width dimensions for visual tokens

ViT

  • Adapted from apple/aimv2-huge-patch14-336:
    • Architecture Aimv2VisionModel
    • ~681M params
    • 24 layers
    • hidden_size (n_embd): 1536
    • intermediate_size (n_ff): 4096
    • image_size: 336
    • patch_size: 14
    • num_channels: 3
    • depth: 24
  • Tensor names start with model.visual.
  • Its 2D positional embeddings are dynamically adapted via bicubic interpolation within the Glm4vMoeVisionEmbeddings module to handle varied image resolutions
  • It also applies its own rotary position embeddings within the self-attention blocks (via apply_rotary_pos_emb_vision)
  • Same for both models

Other notes:

  • Native context length is 65,536 (as opposed to 131,072 for GLM-4.5-Air)
  • RoPE theta (θ): 10,000.0 (as opposed to 100,000.0 for GLM-4.5-Air)
  • The model supports video input, but I currently do not plan to support video input in this PR (images only)
  • Tokenizer has video-related special tokens - need to handle these during conversion

References:

See also:

ddh0 avatar Oct 15 '25 19:10 ddh0

So, it turns out that vision in this model is based on Qwen3-VL, which still needs support from llama.cpp. I am pretty familiar with llama.cpp in general but not with mtmd, so I may not be able to get this PR done on my own. I will keep trying to hack at it when I have time, and I would appreciate any help I could get. :)

Also just saw this thread (#16207) in which someone has posted a patch to get Qwen3-VL kinda-sorta-working in llama.cpp. I will take a look at that too and see if it is helpful - it might make more sense to get Qwen3-VL to a working state in llama.cpp first and only then start working on this PR on top of that. Not sure, just thinking out loud.

ddh0 avatar Oct 17 '25 16:10 ddh0

Thanks for your work! @ddh0 Based on the commit history, the imports of qwen3vl is the result of a "use qwen data class to avoid repeat again" refactor, so probably it's not quite "based on Qwen3-VL". But anyway, I'm planning to dive into Qwen3-VL and GLM-4.5V later this month and I hope I can help.

rujialiu avatar Oct 18 '25 02:10 rujialiu

I'm planning to dive into Qwen3-VL and GLM-4.5V later this month and I hope I can help.

Thank you @rujialiu! I suspect your understanding of the mtmd side of things is better than mine - I could use some guidance on what the next steps should be. I've paused working on this for now until I have a better understanding of what exactly needs to be done.

Also cc @ngxson (llama vision expert :))

ddh0 avatar Oct 18 '25 20:10 ddh0

Thank you @rujialiu! I suspect your understanding of the mtmd side of things is better than mine - I could use some guidance on what the next steps should be. I've paused working on this for now until I have a better understanding of what exactly needs to be done.

I have 0 understanding of mtmd before tackling the "inaccurate bbox" issue 😄 Then many people helped me along the way. So let's do/learn things together!

rujialiu avatar Oct 19 '25 07:10 rujialiu

@ddh0 I asked Claude Sonnet 4.5 to carefully inspect transformers implementation and tell me the differences between Qwen2.5-VL, Qwen3-VL and GLM4V (not a single question. I asked several more specific questions and inspected the codes it gave me). I'm not knowledgable enough to check every detail, but it looks good to me. In short, GLM4V is very close to Qwen2.5-VL:

  • Same chunked RoPE (though different names: MRople vs 2D-Rope/3D-Rope). glm4v's apply_multimodal_rotary_pos_emb even refers to qwenvl's paper
  • Same max(t,h,w) logic
  • Same window attention/patch merging (because it re-uses Qwen2_5_VLVisionAttention and Glm4vVisionPatchMerger. But I haven't checked this carefully)
  • (new) Learnable embeddings + bicubic interpolation (search Perform bicubic interpolation comment)
  • Few different constants, like hidden_size, rms_norm_eps etc

It's so similar to Qwen2.5-VL, but why the code re-uses qwen3_vl_moe? It's because Qwen2.5-VL doesn't have moe version 😄 Maybe we only need to wire up GLM4.5V's LLM with its vision encoder in the most obvious way and we're done.

So I guess it's ok to resume the work directly, based on https://github.com/FMayran/llama.cpp/tree/QwenVL-causal-fix

It should be easy to adapt to whatever "llama_batch improvement" is merged into master later.

BTW: Can we make sure the dense version (GLM-4.1V-9B-Thinking #14495 ) is working first? It's much smaller, easier to compare result with transformers, and it looks like GLM-4.5V is no different besides the LLM part.

rujialiu avatar Oct 20 '25 02:10 rujialiu

Thank you @rujialiu, that's all very helpful. I will take a look at GLM-4.1V-9B-Thinking and see if it can be incorporated into this PR.

Is there a PR associated with the branch you linked (FMayran:QwenVL-causal-fix)? If not, maybe we should start a separate PR to fix the vision implementation of these families of models in general and then come back to this PR to finish the model-specific parts.

ddh0 avatar Oct 20 '25 04:10 ddh0

Thank you @rujialiu, that's all very helpful. I will take a look at GLM-4.1V-9B-Thinking and see if it can be incorporated into this PR.

Is there a PR associated with the branch you linked (FMayran:QwenVL-causal-fix)? If not, maybe we should start a separate PR to fix the vision implementation of these families of models in general and then come back to this PR to finish the model-specific parts.

Of course! Hopefully @ngxson will find some time to fix the general problem (adding an internal token index for casual check). Since you're familiar with LLM part, you can take a look our discussion in #15474 (the quickiest way is to read in a bottom-up order until you understand). The issue and solution is conceptually very simple, but I'm not brave/skillful enough to touch llama-batch.h 😄

rujialiu avatar Oct 20 '25 04:10 rujialiu

Thank you @rujialiu, that's all very helpful. I will take a look at GLM-4.1V-9B-Thinking and see if it can be incorporated into this PR.

Is there a PR associated with the branch you linked (FMayran:QwenVL-causal-fix)? If not, maybe we should start a separate PR to fix the vision implementation of these families of models in general and then come back to this PR to finish the model-specific parts.

now there is: https://github.com/ggml-org/llama.cpp/pull/16745

FMayran avatar Oct 23 '25 16:10 FMayran

I've just merged the latest from master and resolved some minor conflicts by hand. Since #16780, #16825, and #16848 are all merged now (yay!!) I will resume working on this. I'll need some time to wrap my head around how Qwen3VL vision works and how the implementation of GLM-4.5V may differ.

ddh0 avatar Oct 30 '25 19:10 ddh0

I'm getting busy recently, but I see cubic interpolation (we need it) is almost ready #16891 😄 @ddh0

rujialiu avatar Nov 01 '25 10:11 rujialiu

  • Its 2D positional embeddings are dynamically adapted via bicubic interpolation within the Glm4vMoeVisionEmbeddings module to handle varied image resolutions

  • It also applies its own rotary position embeddings within the self-attention blocks (via apply_rotary_pos_emb_vision)

This is essentially the same thing that LFM2 use, you can copy most of the code from this model (already supported by mtmd)

The key difference is in the projection stage, GLM4V uses:

  • one conv2d to merge output tokens (2x2 square kernel), see: https://github.com/huggingface/transformers/blob/8012f80f722044fd0dda45b4034f89fffc2ff344/src/transformers/models/glm4v/modeling_glm4v.py#L731
  • then finally project to text embeddings using a simple FFN, see: https://github.com/huggingface/transformers/blob/8012f80f722044fd0dda45b4034f89fffc2ff344/src/transformers/models/glm4v/modeling_glm4v.py#L116

ngxson avatar Nov 06 '25 21:11 ngxson