mtmd: Add JinaCLIP v2 vision projector + GGUF support for jina-bert-v3 (merged-LoRA or adapter)
Update Notes (2025‑11‑6)
- CLI Merge
- Fold the standalone Jina CLI into mtmd-cli’s projector‑only flow; remove the extra binary.
- Conversion Script (set_gguf_parameters)
- Emit vision keys using the standard naming: clip.has_vision_encoder, clip.vision.image_size/patch_size/embedding_length/ block_count/projection_dim/feed_forward_length/attention.head_count.
- Write only projector_type (set to 'jinaclip2'); do not introduce projector_version.
- Inference (mtmd)
- Use ggml_rope_ext to implement 2D RoPE; reuse bicubic for image preprocessing.
- Minimal Validation
- Conversion succeeds; gguf_dump shows clip.projector_type='jinaclip2'.
- Minimal inference passes for both text and image; C++ vs Python cosine/RMSE are within the expected range. Reproduction
Minimal commands & data (CPU)
- Produce GGUF (with ST pooling metadata)
- Text:
jina-bert-v3.pooling_type = MEAN/CLS/LAST - Vision:
clip.projector_type = jinaclip2,clip.vision.rope_theta = 10000(default)
- Text:
- Text parity
- C++:
CUDA_VISIBLE_DEVICES= ./build/bin/llama-embedding -m /path/jina-text-converted.gguf -p "hello world" --n-gpu-layers 0 --pooling mean --embd-normalize 2 --embd-output-format array - Python:
python3 <ref>/debug.py --mode text --input "hello world" --out-dir <dir> --fa off - Metric: read both 512-d outputs and compute cosine / RMSE
- C++:
- Image parity
- C++:
CUDA_VISIBLE_DEVICES= ./build/bin/llama-mtmd-cli --mmproj /path/mmproj-jina-vision-converted.gguf --image /path/img.jpg --n-gpu-layers 0 --embd-normalize 2 --embd-output-format array - Python:
python3 <ref>/debug.py --mode image --input /path/img.jpg --out-dir <dir> --fa off - Metric: read both 512-d outputs and compute cosine / RMSE
- C++:
mtmd: Add JinaCLIP v2 vision projector + GGUF support for jina-bert-v3 (merged-LoRA or adapter)
Overview
- Converter: write jina-bert-v3 text tower params into GGUF (supports both merged-LoRA checkpoints and adapter-based inputs), and export vision metadata (projector_type=jinaclip, vision.rope_theta, image_size, patch_size, projection_dim, etc.).
- Runtime: introduce PROJECTOR_TYPE_JINACLIP in the MTMD path (JinaCLIP v2 vision tower: 2D RoPE with shared frequency cache, attention/FFN internal LayerNorm, single-token output), and normalize with
common_embd_normalize(..., 2). - CLI (core): add a minimal validation tool
llama-jinaclip-cli(built by default) for text/image embedding numerical/performance checks; depends only on common+mtmd+Threads, cross-platform buildable, no third-party deps. - Compatibility: only activates when related GGUF metadata exists; doesn’t affect other projectors (e.g., LLaVA/Qwen2VL); no ggml op changes; no external dependencies.
Scope of changes
- convert_hf_to_gguf.py
- Text: support both merged-LoRA single checkpoints and adapter-based export.
- Vision (JinaCLIP v2): export
clip.projector_type=jinaclip,clip.vision.rope_theta(configurable), image_size/patch_size/projection_dim, and map tensors for fused/non-fused QKV.
- tools/mtmd/clip.cpp, tools/mtmd/clip-impl.h
- Add PROJECTOR_TYPE_JINACLIP: JinaCLIP v2 vision tower (2D RoPE with shared freq cache), attention internal LN, FFN sub-layer LN (enabled when both weight/bias present), single-token output (CLS-equivalent), unified L2 normalize.
clip_n_output_tokens()returns 1 for JinaCLIP;clip_n_mmproj_embd()returns projection_dim.
- tools/mtmd/jinaclip-cli.cpp, tools/mtmd/CMakeLists.txt
- Add
llama-jinaclip-clitarget (default); one command covers text/image minimal validation, thread scaling, encode_ms reporting, and saves embeddings for Python parity.
- Add
Validation summary
- CI: CPU-only
ci/run.shpasses locally; no ggml op changes in this PR. - Correctness: embedding models have no perplexity; we verify via C++ vs Python parity.
- TEXT (CPU, minimal sample): cosine=0.999996, RMSE=0.000125
- IMAGE (CPU, minimal sample): cosine=0.990261, RMSE=0.006168
- Performance: checked with CLI
encode_msand thread scaling; no regression observed. More data can be added if requested. - Compatibility: activated only when GGUF metadata (projector_type=jinaclip, etc.) is present; other projectors unaffected.
- Reference: ModelScope uniontech-yourong/split_jina (used for Python-side parity).
Performance (absolute metrics, CPU-only minimal samples)
- Environment
- OS: Ubuntu 22.04.5 LTS
- CPU: Intel Xeon Platinum 8352V (dual-socket, 2×32C/64T, SMT on), 128 threads total
- Build: Release, GGML_CUDA=OFF (CPU-only), GCC 11.4, CMake 3.22
- Model: JinaCLIP v2 vision tower (image_size=512, patch=14, depth=24, hidden=1024; official: https://huggingface.co/jinaai/jina-clip-v2); text tower (Jina Embeddings v3, output truncated to 512 dims)
- Threads: primarily 8 threads for both text/image (with 1-thread comparison)
- Metric definitions
- Text: use CLI-reported JINACLIP_ENCODE_MS (pure inference, excludes load)
- Image: use CLI line “image … done in … ms” (pure inference, excludes load)
- Results (single sample, minimal)
- Text (“hello world”, ≈5 tokens)
- 1 thread: encode_ms ≈ 180.48 ms
- 8 threads: encode_ms ≈ 34.08 ms
- Image (512×512, single)
- 8 threads: image done in ≈ 6154 ms (stabilizes ~6.1–6.4 s after warm-up)
- Text (“hello world”, ≈5 tokens)
- Notes
- Above numbers are CPU-only pure inference; end-to-end (including model load) is higher and not included.
GPU group (absolute metrics, minimal samples)
- Environment
- GPU: NVIDIA vGPU-32GB (cc=8.9, 32 GB), Driver 550.107, CUDA 12.4
- Build: Release, GGML_CUDA=ON (CUDA backend), CUDA arch=89
- Threads: -t 8 (host-side preprocessing threads)
- Results (pure inference, excludes load)
- Text (“hello world”, ≈5 tokens): encode_ms ≈ 84.88 ms
- Image (512×512, single): image done in ≈ 827 ms
@pockers21 What's up?
@pockers21 What's up?
I’m currently adjusting the code and fixing issues. I originally planned to answer your questions together when moving the PR from draft to a formal PR, let me explain now. The link you shared (https://huggingface.co/jinaai/jina-clip-v2/blob/main/config.json#L15-L38) points to the official Jina config that includes LoRA. In our work, we modified the official Jina to fuse the text-side LoRA into the base model and then exported it to GGUF. Under JINA logic, those fields won’t take effect when loading Jina v2; they are only triggered when loading the embeddings v3 model.
@pockers21 You need to address the tensor mappings, as pointed out by @ngxson, use tensor_mapping.py where possible.
@pockers21 You need to address the tensor mappings, as pointed out by @ngxson, use
tensor_mapping.pywhere possible.
Done, please review again.
Hmmm, there's a major issue with conversion, the text_config is normally applied on top of the remote jina-embeddings-v3 config.json by transformers, however convert_hf_to_gguf.py has no concept of this when reading the jina-clip-v2 config.json (because trust_remote_code=False).
This means that:
- It thinks that the text model architecture is
JinaCLIPModeland fails - The vision model conversion fails
assert self.n_embd_text > 0, "n_embd not found in hparams"
I tried kludging it by copying in values, but I got several other failures, so it's just not working...
Hmmm, there's a major issue with conversion, the
text_configis normally applied on top of the remotejina-embeddings-v3config.jsonbytransformers, howeverconvert_hf_to_gguf.pyhas no concept of this when reading thejina-clip-v2config.json(becausetrust_remote_code=False).This means that:
- It thinks that the text model architecture is
JinaCLIPModeland fails- The vision model conversion fails
assert self.n_embd_text > 0, "n_embd not found in hparams"I tried kludging it by copying in values, but I got several other failures, so it's just not working...
The original Jina model is a single multi-modal checkpoint that contains both text and vision components, and the text side includes a LoRA head. In our workflow, we did two things:
- We split this original model into separate text and vision parts.
- We merged the text LoRA head back into the text encoder weights.
If you want to run conversion, you should follow the layout used here: https://www.modelscope.cn/models/uniontech-yourong/split_jina/files
Concretely, our implementation assumes that you:
-
git clonethat model repo locally (after opening the page and clicking the “Download model” button, ModelScope will show you the exact clone/download command). -
After the download, run:
python3 convert_hf_to_gguf.py ORIG_IMAGE_PATH --outfile out.gguf --mmproj
Here, ORIG_IMAGE_PATH must point to the split_jina/image directory. In that directory you will see a vision_model_weights.bin file, which is what the converter expects to load for the vision encoder.
Hmmm, there's a major issue with conversion, the
text_configis normally applied on top of the remotejina-embeddings-v3config.jsonbytransformers, howeverconvert_hf_to_gguf.pyhas no concept of this when reading thejina-clip-v2config.json(becausetrust_remote_code=False).This means that:
- It thinks that the text model architecture is
JinaCLIPModeland fails- The vision model conversion fails
assert self.n_embd_text > 0, "n_embd not found in hparams"I tried kludging it by copying in values, but I got several other failures, so it's just not working...
Looking forward to your feedback.
TBH, I'm not sure this is acceptable, I would expect to be able to convert the original model, granted it's a little tricky due to the way it's constructed, but should be doable.
It might be acceptable to have a preprocessing script for it, but that's not ideal, @ngxson any opinions?
Some thoughts about this PR:
- The problem with LoRA should be addressed in a dedicated PR, as it is completely unrelated to the multimodal system. Merging LoRA into the model seems counter-intuitive because the LoRA is only used for query. Merging will completely cut off its capability to acts as an indexer.
- I don't get why we need to implement a llama_context-less version of mtmd. The original model contains both text and image encoder, we can just use the text model to initialize the llama_context
- The array of API
mtmd_mmproj_*that you implemented in this PR is too model-specific. I don't think we need to add any additional APIs, the existing mtmd API is already enough
So unfortunately, unless (1) is resolve, I don't think we can proceed to merge this PR.