llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

mtmd: Add JinaCLIP v2 vision projector + GGUF support for jina-bert-v3 (merged-LoRA or adapter)

Open pockers21 opened this issue 2 months ago • 2 comments

Update Notes (2025‑11‑6)

  • CLI Merge
    • Fold the standalone Jina CLI into mtmd-cli’s projector‑only flow; remove the extra binary.
  • Conversion Script (set_gguf_parameters)
    • Emit vision keys using the standard naming: clip.has_vision_encoder, clip.vision.image_size/patch_size/embedding_length/ block_count/projection_dim/feed_forward_length/attention.head_count.
    • Write only projector_type (set to 'jinaclip2'); do not introduce projector_version.
  • Inference (mtmd)
    • Use ggml_rope_ext to implement 2D RoPE; reuse bicubic for image preprocessing.
  • Minimal Validation
    • Conversion succeeds; gguf_dump shows clip.projector_type='jinaclip2'.
    • Minimal inference passes for both text and image; C++ vs Python cosine/RMSE are within the expected range. Reproduction
Minimal commands & data (CPU)
  • Produce GGUF (with ST pooling metadata)
    • Text: jina-bert-v3.pooling_type = MEAN/CLS/LAST
    • Vision: clip.projector_type = jinaclip2, clip.vision.rope_theta = 10000 (default)
  • Text parity
    • C++: CUDA_VISIBLE_DEVICES= ./build/bin/llama-embedding -m /path/jina-text-converted.gguf -p "hello world" --n-gpu-layers 0 --pooling mean --embd-normalize 2 --embd-output-format array
    • Python: python3 <ref>/debug.py --mode text --input "hello world" --out-dir <dir> --fa off
    • Metric: read both 512-d outputs and compute cosine / RMSE
  • Image parity
    • C++: CUDA_VISIBLE_DEVICES= ./build/bin/llama-mtmd-cli --mmproj /path/mmproj-jina-vision-converted.gguf --image /path/img.jpg --n-gpu-layers 0 --embd-normalize 2 --embd-output-format array
    • Python: python3 <ref>/debug.py --mode image --input /path/img.jpg --out-dir <dir> --fa off
    • Metric: read both 512-d outputs and compute cosine / RMSE

mtmd: Add JinaCLIP v2 vision projector + GGUF support for jina-bert-v3 (merged-LoRA or adapter)

Overview

  • Converter: write jina-bert-v3 text tower params into GGUF (supports both merged-LoRA checkpoints and adapter-based inputs), and export vision metadata (projector_type=jinaclip, vision.rope_theta, image_size, patch_size, projection_dim, etc.).
  • Runtime: introduce PROJECTOR_TYPE_JINACLIP in the MTMD path (JinaCLIP v2 vision tower: 2D RoPE with shared frequency cache, attention/FFN internal LayerNorm, single-token output), and normalize with common_embd_normalize(..., 2).
  • CLI (core): add a minimal validation tool llama-jinaclip-cli (built by default) for text/image embedding numerical/performance checks; depends only on common+mtmd+Threads, cross-platform buildable, no third-party deps.
  • Compatibility: only activates when related GGUF metadata exists; doesn’t affect other projectors (e.g., LLaVA/Qwen2VL); no ggml op changes; no external dependencies.

Scope of changes

  • convert_hf_to_gguf.py
    • Text: support both merged-LoRA single checkpoints and adapter-based export.
    • Vision (JinaCLIP v2): export clip.projector_type=jinaclip, clip.vision.rope_theta (configurable), image_size/patch_size/projection_dim, and map tensors for fused/non-fused QKV.
  • tools/mtmd/clip.cpp, tools/mtmd/clip-impl.h
    • Add PROJECTOR_TYPE_JINACLIP: JinaCLIP v2 vision tower (2D RoPE with shared freq cache), attention internal LN, FFN sub-layer LN (enabled when both weight/bias present), single-token output (CLS-equivalent), unified L2 normalize.
    • clip_n_output_tokens() returns 1 for JinaCLIP; clip_n_mmproj_embd() returns projection_dim.
  • tools/mtmd/jinaclip-cli.cpp, tools/mtmd/CMakeLists.txt
    • Add llama-jinaclip-cli target (default); one command covers text/image minimal validation, thread scaling, encode_ms reporting, and saves embeddings for Python parity.

Validation summary

  • CI: CPU-only ci/run.sh passes locally; no ggml op changes in this PR.
  • Correctness: embedding models have no perplexity; we verify via C++ vs Python parity.
    • TEXT (CPU, minimal sample): cosine=0.999996, RMSE=0.000125
    • IMAGE (CPU, minimal sample): cosine=0.990261, RMSE=0.006168
  • Performance: checked with CLI encode_ms and thread scaling; no regression observed. More data can be added if requested.
  • Compatibility: activated only when GGUF metadata (projector_type=jinaclip, etc.) is present; other projectors unaffected.
  • Reference: ModelScope uniontech-yourong/split_jina (used for Python-side parity).

Performance (absolute metrics, CPU-only minimal samples)

  • Environment
    • OS: Ubuntu 22.04.5 LTS
    • CPU: Intel Xeon Platinum 8352V (dual-socket, 2×32C/64T, SMT on), 128 threads total
    • Build: Release, GGML_CUDA=OFF (CPU-only), GCC 11.4, CMake 3.22
    • Model: JinaCLIP v2 vision tower (image_size=512, patch=14, depth=24, hidden=1024; official: https://huggingface.co/jinaai/jina-clip-v2); text tower (Jina Embeddings v3, output truncated to 512 dims)
    • Threads: primarily 8 threads for both text/image (with 1-thread comparison)
  • Metric definitions
    • Text: use CLI-reported JINACLIP_ENCODE_MS (pure inference, excludes load)
    • Image: use CLI line “image … done in … ms” (pure inference, excludes load)
  • Results (single sample, minimal)
    • Text (“hello world”, ≈5 tokens)
      • 1 thread: encode_ms ≈ 180.48 ms
      • 8 threads: encode_ms ≈ 34.08 ms
    • Image (512×512, single)
      • 8 threads: image done in ≈ 6154 ms (stabilizes ~6.1–6.4 s after warm-up)
  • Notes
    • Above numbers are CPU-only pure inference; end-to-end (including model load) is higher and not included.

GPU group (absolute metrics, minimal samples)

  • Environment
    • GPU: NVIDIA vGPU-32GB (cc=8.9, 32 GB), Driver 550.107, CUDA 12.4
    • Build: Release, GGML_CUDA=ON (CUDA backend), CUDA arch=89
    • Threads: -t 8 (host-side preprocessing threads)
  • Results (pure inference, excludes load)
    • Text (“hello world”, ≈5 tokens): encode_ms ≈ 84.88 ms
    • Image (512×512, single): image done in ≈ 827 ms

pockers21 avatar Oct 14 '25 09:10 pockers21

@pockers21 What's up?

CISC avatar Oct 28 '25 10:10 CISC

@pockers21 What's up?

I’m currently adjusting the code and fixing issues. I originally planned to answer your questions together when moving the PR from draft to a formal PR, let me explain now. The link you shared (https://huggingface.co/jinaai/jina-clip-v2/blob/main/config.json#L15-L38) points to the official Jina config that includes LoRA. In our work, we modified the official Jina to fuse the text-side LoRA into the base model and then exported it to GGUF. Under JINA logic, those fields won’t take effect when loading Jina v2; they are only triggered when loading the embeddings v3 model.

pockers21 avatar Oct 29 '25 07:10 pockers21

@pockers21 You need to address the tensor mappings, as pointed out by @ngxson, use tensor_mapping.py where possible.

CISC avatar Nov 15 '25 16:11 CISC

@pockers21 You need to address the tensor mappings, as pointed out by @ngxson, use tensor_mapping.py where possible.

Done, please review again.

pockers21 avatar Nov 20 '25 05:11 pockers21

Hmmm, there's a major issue with conversion, the text_config is normally applied on top of the remote jina-embeddings-v3 config.json by transformers, however convert_hf_to_gguf.py has no concept of this when reading the jina-clip-v2 config.json (because trust_remote_code=False).

This means that:

  1. It thinks that the text model architecture is JinaCLIPModel and fails
  2. The vision model conversion fails assert self.n_embd_text > 0, "n_embd not found in hparams"

I tried kludging it by copying in values, but I got several other failures, so it's just not working...

CISC avatar Nov 20 '25 18:11 CISC

Hmmm, there's a major issue with conversion, the text_config is normally applied on top of the remote jina-embeddings-v3 config.json by transformers, however convert_hf_to_gguf.py has no concept of this when reading the jina-clip-v2 config.json (because trust_remote_code=False).

This means that:

  1. It thinks that the text model architecture is JinaCLIPModel and fails
  2. The vision model conversion fails assert self.n_embd_text > 0, "n_embd not found in hparams"

I tried kludging it by copying in values, but I got several other failures, so it's just not working...

The original Jina model is a single multi-modal checkpoint that contains both text and vision components, and the text side includes a LoRA head. In our workflow, we did two things:

  1. We split this original model into separate text and vision parts.
  2. We merged the text LoRA head back into the text encoder weights.

If you want to run conversion, you should follow the layout used here: https://www.modelscope.cn/models/uniontech-yourong/split_jina/files

Concretely, our implementation assumes that you:

  1. git clone that model repo locally (after opening the page and clicking the “Download model” button, ModelScope will show you the exact clone/download command).

  2. After the download, run:

    python3 convert_hf_to_gguf.py ORIG_IMAGE_PATH --outfile out.gguf --mmproj
    
    

Here, ORIG_IMAGE_PATH must point to the split_jina/image directory. In that directory you will see a vision_model_weights.bin file, which is what the converter expects to load for the vision encoder.

pockers21 avatar Nov 21 '25 02:11 pockers21

Hmmm, there's a major issue with conversion, the text_config is normally applied on top of the remote jina-embeddings-v3 config.json by transformers, however convert_hf_to_gguf.py has no concept of this when reading the jina-clip-v2 config.json (because trust_remote_code=False).

This means that:

  1. It thinks that the text model architecture is JinaCLIPModel and fails
  2. The vision model conversion fails assert self.n_embd_text > 0, "n_embd not found in hparams"

I tried kludging it by copying in values, but I got several other failures, so it's just not working...

Looking forward to your feedback.

pockers21 avatar Nov 27 '25 08:11 pockers21

TBH, I'm not sure this is acceptable, I would expect to be able to convert the original model, granted it's a little tricky due to the way it's constructed, but should be doable.

It might be acceptable to have a preprocessing script for it, but that's not ideal, @ngxson any opinions?

CISC avatar Nov 28 '25 21:11 CISC

Some thoughts about this PR:

  1. The problem with LoRA should be addressed in a dedicated PR, as it is completely unrelated to the multimodal system. Merging LoRA into the model seems counter-intuitive because the LoRA is only used for query. Merging will completely cut off its capability to acts as an indexer.
  2. I don't get why we need to implement a llama_context-less version of mtmd. The original model contains both text and image encoder, we can just use the text model to initialize the llama_context
  3. The array of API mtmd_mmproj_* that you implemented in this PR is too model-specific. I don't think we need to add any additional APIs, the existing mtmd API is already enough

So unfortunately, unless (1) is resolve, I don't think we can proceed to merge this PR.

ngxson avatar Nov 30 '25 18:11 ngxson