llama.cpp support MiniCPM-V-2

I created a folder called "minicpmv" in the examples folder of llama.cpp. More detail can be seen in llama.cpp/examples/minicpmv/README.md.

The code is based on examples/llava but the vision part is quite different.

Apr 26 '24 08:04 Achazwl

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 555 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8400.3ms p(95)=19734.55ms fails=, finish reason: stop=511 truncated=44
Prompt processing (pp): avg=103.65tk/s p(95)=452.32tk/s
Token generation (tg): avg=34.33tk/s p(95)=46.58tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=feat-minicpmv commit=70a23863dcff7458839960b304ea166a401d4d8e

prompt_tokens_seconds

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 555 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716473680 --> 1716474300
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 399.43, 399.43, 399.43, 399.43, 399.43, 910.28, 910.28, 910.28, 910.28, 910.28, 794.64, 794.64, 794.64, 794.64, 794.64, 806.68, 806.68, 806.68, 806.68, 806.68, 871.83, 871.83, 871.83, 871.83, 871.83, 880.42, 880.42, 880.42, 880.42, 880.42, 871.52, 871.52, 871.52, 871.52, 871.52, 884.07, 884.07, 884.07, 884.07, 884.07, 893.5, 893.5, 893.5, 893.5, 893.5, 904.16, 904.16, 904.16, 904.16, 904.16, 895.77, 895.77, 895.77, 895.77, 895.77, 893.23, 893.23, 893.23, 893.23, 893.23, 911.86, 911.86, 911.86, 911.86, 911.86, 914.39, 914.39, 914.39, 914.39, 914.39, 903.48, 903.48, 903.48, 903.48, 903.48, 881.01, 881.01, 881.01, 881.01, 881.01, 884.16, 884.16, 884.16, 884.16, 884.16, 876.15, 876.15, 876.15, 876.15, 876.15, 873.63, 873.63, 873.63, 873.63, 873.63, 831.29, 831.29, 831.29, 831.29, 831.29, 832.56, 832.56, 832.56, 832.56, 832.56, 838.73, 838.73, 838.73, 838.73, 838.73, 842.35, 842.35, 842.35, 842.35, 842.35, 852.1, 852.1, 852.1, 852.1, 852.1, 828.17, 828.17, 828.17, 828.17, 828.17, 831.51, 831.51, 831.51, 831.51, 831.51, 833.09, 833.09, 833.09, 833.09, 833.09, 812.09, 812.09, 812.09, 812.09, 812.09, 811.03, 811.03, 811.03, 811.03, 811.03, 812.54, 812.54, 812.54, 812.54, 812.54, 819.03, 819.03, 819.03, 819.03, 819.03, 818.86, 818.86, 818.86, 818.86, 818.86, 818.31, 818.31, 818.31, 818.31, 818.31, 828.36, 828.36, 828.36, 828.36, 828.36, 822.09, 822.09, 822.09, 822.09, 822.09, 827.13, 827.13, 827.13, 827.13, 827.13, 805.06, 805.06, 805.06, 805.06, 805.06, 803.22, 803.22, 803.22, 803.22, 803.22, 805.52, 805.52, 805.52, 805.52, 805.52, 808.35, 808.35, 808.35, 808.35, 808.35, 809.36, 809.36, 809.36, 809.36, 809.36, 811.4, 811.4, 811.4, 811.4, 811.4, 808.28, 808.28, 808.28, 808.28, 808.28, 796.46, 796.46, 796.46, 796.46, 796.46, 796.57, 796.57, 796.57, 796.57, 796.57, 796.1, 796.1, 796.1, 796.1, 796.1, 801.6, 801.6, 801.6, 801.6, 801.6, 804.22, 804.22, 804.22, 804.22, 804.22, 803.76, 803.76, 803.76, 803.76, 803.76, 809.35, 809.35, 809.35, 809.35, 809.35, 809.22, 809.22, 809.22, 809.22, 809.22, 812.61, 812.61, 812.61, 812.61, 812.61, 807.83, 807.83, 807.83, 807.83, 807.83, 814.96, 814.96, 814.96, 814.96, 814.96, 814.12, 814.12, 814.12, 814.12, 814.12, 815.86, 815.86, 815.86, 815.86, 815.86, 816.03, 816.03, 816.03, 816.03, 816.03, 816.02, 816.02, 816.02, 816.02, 816.02, 818.09, 818.09, 818.09, 818.09, 818.09, 821.09, 821.09, 821.09, 821.09, 821.09, 820.36]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 555 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716473680 --> 1716474300
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 37.39, 37.39, 37.39, 37.39, 37.39, 40.72, 40.72, 40.72, 40.72, 40.72, 28.9, 28.9, 28.9, 28.9, 28.9, 29.61, 29.61, 29.61, 29.61, 29.61, 30.2, 30.2, 30.2, 30.2, 30.2, 30.31, 30.31, 30.31, 30.31, 30.31, 30.85, 30.85, 30.85, 30.85, 30.85, 31.48, 31.48, 31.48, 31.48, 31.48, 32.55, 32.55, 32.55, 32.55, 32.55, 32.55, 32.55, 32.55, 32.55, 32.55, 32.42, 32.42, 32.42, 32.42, 32.42, 32.5, 32.5, 32.5, 32.5, 32.5, 32.35, 32.35, 32.35, 32.35, 32.35, 31.12, 31.12, 31.12, 31.12, 31.12, 30.82, 30.82, 30.82, 30.82, 30.82, 29.78, 29.78, 29.78, 29.78, 29.78, 29.42, 29.42, 29.42, 29.42, 29.42, 28.62, 28.62, 28.62, 28.62, 28.62, 28.8, 28.8, 28.8, 28.8, 28.8, 29.35, 29.35, 29.35, 29.35, 29.35, 29.18, 29.18, 29.18, 29.18, 29.18, 29.19, 29.19, 29.19, 29.19, 29.19, 29.38, 29.38, 29.38, 29.38, 29.38, 29.63, 29.63, 29.63, 29.63, 29.63, 29.59, 29.59, 29.59, 29.59, 29.59, 29.81, 29.81, 29.81, 29.81, 29.81, 30.06, 30.06, 30.06, 30.06, 30.06, 30.13, 30.13, 30.13, 30.13, 30.13, 30.11, 30.11, 30.11, 30.11, 30.11, 30.21, 30.21, 30.21, 30.21, 30.21, 30.62, 30.62, 30.62, 30.62, 30.62, 30.79, 30.79, 30.79, 30.79, 30.79, 30.87, 30.87, 30.87, 30.87, 30.87, 31.09, 31.09, 31.09, 31.09, 31.09, 31.02, 31.02, 31.02, 31.02, 31.02, 30.93, 30.93, 30.93, 30.93, 30.93, 30.73, 30.73, 30.73, 30.73, 30.73, 30.5, 30.5, 30.5, 30.5, 30.5, 30.77, 30.77, 30.77, 30.77, 30.77, 30.91, 30.91, 30.91, 30.91, 30.91, 30.98, 30.98, 30.98, 30.98, 30.98, 31.07, 31.07, 31.07, 31.07, 31.07, 30.88, 30.88, 30.88, 30.88, 30.88, 30.66, 30.66, 30.66, 30.66, 30.66, 30.19, 30.19, 30.19, 30.19, 30.19, 29.42, 29.42, 29.42, 29.42, 29.42, 29.5, 29.5, 29.5, 29.5, 29.5, 29.49, 29.49, 29.49, 29.49, 29.49, 29.46, 29.46, 29.46, 29.46, 29.46, 29.47, 29.47, 29.47, 29.47, 29.47, 29.57, 29.57, 29.57, 29.57, 29.57, 29.58, 29.58, 29.58, 29.58, 29.58, 29.57, 29.57, 29.57, 29.57, 29.57, 29.52, 29.52, 29.52, 29.52, 29.52, 29.52, 29.52, 29.52, 29.52, 29.52, 29.69, 29.69, 29.69, 29.69, 29.69, 29.79, 29.79, 29.79, 29.79, 29.79, 29.92, 29.92, 29.92, 29.92, 29.92, 30.01, 30.01, 30.01, 30.01, 30.01, 30.05, 30.05, 30.05, 30.05, 30.05, 30.09]

Details

kv_cache_usage_ratio

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 555 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716473680 --> 1716474300
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.13, 0.13, 0.13, 0.13, 0.13, 0.34, 0.34, 0.34, 0.34, 0.34, 0.28, 0.28, 0.28, 0.28, 0.28, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.25, 0.25, 0.25, 0.25, 0.25, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.23, 0.23, 0.23, 0.23, 0.23, 0.24, 0.24, 0.24, 0.24, 0.24, 0.21, 0.21, 0.21, 0.21, 0.21, 0.38, 0.38, 0.38, 0.38, 0.38, 0.24, 0.24, 0.24, 0.24, 0.24, 0.4, 0.4, 0.4, 0.4, 0.4, 0.33, 0.33, 0.33, 0.33, 0.33, 0.3, 0.3, 0.3, 0.3, 0.3, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.28, 0.28, 0.28, 0.28, 0.28, 0.22, 0.22, 0.22, 0.22, 0.22, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.24, 0.24, 0.24, 0.24, 0.24, 0.1, 0.1, 0.1, 0.1, 0.1, 0.14, 0.14, 0.14, 0.14, 0.14, 0.17, 0.17, 0.17, 0.17, 0.17, 0.28, 0.28, 0.28, 0.28, 0.28, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.1, 0.1, 0.1, 0.1, 0.1, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.29, 0.29, 0.29, 0.29, 0.29, 0.37, 0.37, 0.37, 0.37, 0.37, 0.1, 0.1, 0.1, 0.1, 0.1, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12, 0.12, 0.16, 0.16, 0.16, 0.16, 0.16, 0.3, 0.3, 0.3, 0.3, 0.3, 0.43, 0.43, 0.43, 0.43, 0.43, 0.62, 0.62, 0.62, 0.62, 0.62, 0.37, 0.37, 0.37, 0.37, 0.37, 0.09, 0.09, 0.09, 0.09, 0.09, 0.18, 0.18, 0.18, 0.18, 0.18, 0.27, 0.27, 0.27, 0.27, 0.27, 0.24, 0.24, 0.24, 0.24, 0.24, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.25, 0.25, 0.25, 0.25, 0.25, 0.29, 0.29, 0.29, 0.29, 0.29, 0.26, 0.26, 0.26, 0.26, 0.26, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19, 0.19, 0.19, 0.1]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 555 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716473680 --> 1716474300
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0, 3.0]

Apr 26 '24 10:04 github-actions[bot]

Encounter some bugs when building this PR on Windows.

Log here: https://github.com/MZWNET/actions/actions/runs/8896316696/job/24430759871

Apr 30 '24 15:04 mzwing

Encounter some bugs when building this PR on Windows.

Log here: https://github.com/MZWNET/actions/actions/runs/8896316696/job/24430759871

Can 6c1c4b4 fix this?

May 01 '24 03:05 Achazwl

Can 6c1c4b4 fix this?

Build successfully, thx for your great work!

May 01 '24 10:05 mzwing

Get another bug when converting the image encoder to gguf.

Log:

python3 ./examples/minicpmv/convert-image-encoder-to-gguf.py -m ../MiniCPM-V-2 --llava-projector ../MiniCPM-V-2/llava.projector --output-dir ../MiniCPM-V-2-GGUF --image-mean 0.5 0.5 0.5 --image-std 0.5 0.5 0.5

gguf: This GGUF file is for Little Endian only
Traceback (most recent call last):
  File "/content/llama.cpp/./examples/minicpmv/convert-image-encoder-to-gguf.py", line 295, in <module>
    data = data.squeeze().numpy()
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

May 01 '24 15:05 mzwing

BTW it may be better if you can add device_map="auto" in minicpm-surgery.py#L12&L42 :)

It can take full advantages of a GPU :)

May 01 '24 16:05 mzwing

Get another bug when converting the image encoder to gguf.

Log:

python3 ./examples/minicpmv/convert-image-encoder-to-gguf.py -m ../MiniCPM-V-2 --llava-projector ../MiniCPM-V-2/llava.projector --output-dir ../MiniCPM-V-2-GGUF --image-mean 0.5 0.5 0.5 --image-std 0.5 0.5 0.5

gguf: This GGUF file is for Little Endian only
Traceback (most recent call last):
  File "/content/llama.cpp/./examples/minicpmv/convert-image-encoder-to-gguf.py", line 295, in <module>
    data = data.squeeze().numpy()
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

I followed examples/llava/convert-image-encoder-to-gguf.py. It seems that they also don't use .cpu() here, and in my environment the model are loaded to CPU by default.

May 02 '24 01:05 Achazwl

Should set it so the device defaults to CPU and optionally set to GPU if desired.

May 02 '24 01:05 teleprint-me

I followed examples/llava/convert-image-encoder-to-gguf.py. It seems that they also don't use .cpu() here, and in my environment the model are loaded to CPU by default.

I fixed my problems by editing the minicpmv-surgery.py.

Edited version:

# store these tensors in a new dictionary and torch.save them
projector = {name: checkpoint[name].float().cpu() for name in mm_tensors}
torch.save(projector, f"{args.model}/llava.projector")

clip_tensors = [k for k, v in checkpoint.items() if k.startswith("vpm")]
if len(clip_tensors) > 0:
    clip = {name.replace("vpm.", ""): checkpoint[name].float().cpu() for name in clip_tensors}
    torch.save(clip, f"{args.model}/llava.clip")

I think it would be better to add the cpu(), since it has no impact (maybe?) on CPU-only environment and, with device_map="auto", we can make good use of GPU.

May 02 '24 05:05 mzwing

Get another bug.

Log:

> python3 ./examples/minicpmv/minicpm-surgery.py -m ../MiniCPM-V-2

Loading checkpoint shards: 100% 2/2 [00:34<00:00, 17.07s/it]
Done!
Now you can convert ../MiniCPM-V-2 to a regular LLaMA GGUF file.
Also, use ../MiniCPM-V-2/llava.projector to prepare a llava-encoder.gguf file.

> python3 ./examples/minicpmv/convert-image-encoder-to-gguf.py -m ../MiniCPM-V-2 --llava-projector ../MiniCPM-V-2/llava.projector --output-dir ../MiniCPM-V-2-GGUF --image-mean 0.5 0.5 0.5 --image-std 0.5 0.5 0.5

gguf: This GGUF file is for Little Endian only
  Converting to float32
resampler.pos_embed - f32 - shape = (64, 2304)
  Converting to float32
...(too long, ignore)
v.post_ln.weight - f32 - shape = (1152,)
  Converting to float32
v.post_ln.bias - f32 - shape = (1152,)
Done. Output file: ../MiniCPM-V-2-GGUF/mmproj-model-f16.gguf

> python3 ./convert-hf-to-gguf.py --outtype f16 --outfile ../MiniCPM-V-2-GGUF/MiniCPM-V-2.F16.gguf ../MiniCPM-V-2/MiniCPM

Loading model: MiniCPM
gguf: This GGUF file is for Little Endian only
Set model parameters
Set model tokenizer
Traceback (most recent call last):
  File "/content/llama.cpp/./convert-hf-to-gguf.py", line 2809, in <module>
    main()
  File "/content/llama.cpp/./convert-hf-to-gguf.py", line 2796, in main
    model_instance.set_vocab()
  File "/content/llama.cpp/./convert-hf-to-gguf.py", line 1645, in set_vocab
    self._set_vocab_llama_hf()
  File "/content/llama.cpp/./convert-hf-to-gguf.py", line 377, in _set_vocab_llama_hf
    vocab = LlamaHfVocab(self.dir_model)
  File "/content/llama.cpp/convert.py", line 523, in __init__
    with open(fname_tokenizer, encoding='utf-8') as f:
FileNotFoundError: [Errno 2] No such file or directory: '../MiniCPM-V-2/MiniCPM/tokenizer.json'

> cp ../MiniCPM-V-2/tokenizer.json ../MiniCPM-V-2/MiniCPM/

> python3 ./convert-hf-to-gguf.py --outtype f16 --outfile ../MiniCPM-V-2-GGUF/MiniCPM-V-2.F16.gguf ../MiniCPM-V-2/MiniCPM

Loading model: MiniCPM
gguf: This GGUF file is for Little Endian only
Set model parameters
Set model tokenizer
The repository for ../MiniCPM-V-2/MiniCPM contains custom code which must be executed to correctly load the model. You can inspect the repository content at [https://hf.co/../MiniCPM-V-2/MiniCPM](https://hf.co/MiniCPM-V-2/MiniCPM).
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.
Do you wish to run the custom code? [y/N] y
Traceback (most recent call last):
  File "/content/llama.cpp/./convert-hf-to-gguf.py", line 2809, in <module>
    main()
  File "/content/llama.cpp/./convert-hf-to-gguf.py", line 2796, in main
    model_instance.set_vocab()
  File "/content/llama.cpp/./convert-hf-to-gguf.py", line 1645, in set_vocab
    self._set_vocab_llama_hf()
  File "/content/llama.cpp/./convert-hf-to-gguf.py", line 377, in _set_vocab_llama_hf
    vocab = LlamaHfVocab(self.dir_model)
  File "/content/llama.cpp/convert.py", line 556, in __init__
    assert self.tokenizer.is_fast  # assume tokenizer.json is used
AssertionError

May 02 '24 06:05 mzwing

Does your MiniCPM-V-2 folder have tokenizer.json? It is a newly uploaded file in https://huggingface.co/openbmb/MiniCPM-V-2/tree/main.

May 02 '24 06:05 Achazwl

Does your MiniCPM-V-2 folder have tokenizer.json? It is a newly uploaded file in https://huggingface.co/openbmb/MiniCPM-V-2/tree/main.

Yes, I confirm that.

Log:

> ls ../MiniCPM-V-2/ -alh

total 8.0G
drwxr-xr-x 4 root root 4.0K May  2 06:17 .
drwxr-xr-x 1 root root 4.0K May  2 06:16 ..
drwxr-xr-x 2 root root 4.0K May  2 06:13 assets
-rw-r--r-- 1 root root 1.2K May  2 06:13 config.json
-rw-r--r-- 1 root root  11K May  2 06:13 configuration_minicpm.py
-rw-r--r-- 1 root root  111 May  2 06:13 generation_config.json
-rw-r--r-- 1 root root 1.7K May  2 06:13 .gitattributes
-rw-r--r-- 1 root root 1.5G May  2 06:17 llava.clip
-rw-r--r-- 1 root root 113M May  2 06:17 llava.projector
drwxr-xr-x 2 root root 4.0K May  2 06:21 MiniCPM
-rw-r--r-- 1 root root 4.7G May  2 06:14 model-00001-of-00002.safetensors
-rw-r--r-- 1 root root 1.8G May  2 06:13 model-00002-of-00002.safetensors
-rw-r--r-- 1 root root  70K May  2 06:13 modeling_minicpm.py
-rw-r--r-- 1 root root  20K May  2 06:13 modeling_minicpmv.py
-rw-r--r-- 1 root root  54K May  2 06:13 model.safetensors.index.json
-rw-r--r-- 1 root root 9.2K May  2 06:13 README.md
-rw-r--r-- 1 root root 5.5K May  2 06:13 resampler.py
-rw-r--r-- 1 root root  651 May  2 06:13 special_tokens_map.json
-rw-r--r-- 1 root root 3.3K May  2 06:13 tokenizer_config.json
-rw-r--r-- 1 root root 6.0M May  2 06:13 tokenizer.json
-rw-r--r-- 1 root root 2.0M May  2 06:13 tokenizer.model

> ls ../MiniCPM-V-2/MiniCPM -alh

total 12G
drwxr-xr-x 2 root root 4.0K May  2 06:21 .
drwxr-xr-x 4 root root 4.0K May  2 06:17 ..
-rw-r--r-- 1 root root 1.5K May  2 06:17 config.json
-rw-r--r-- 1 root root  11K May  2 06:21 configuration_minicpm.py
-rw-r--r-- 1 root root  111 May  2 06:17 generation_config.json
-rw-r--r-- 1 root root 4.7G May  2 06:18 model-00001-of-00003.safetensors
-rw-r--r-- 1 root root 4.7G May  2 06:20 model-00002-of-00003.safetensors
-rw-r--r-- 1 root root 2.0G May  2 06:21 model-00003-of-00003.safetensors
-rw-r--r-- 1 root root  70K May  2 06:21 modeling_minicpm.py
-rw-r--r-- 1 root root  20K May  2 06:21 modeling_minicpmv.py
-rw-r--r-- 1 root root  30K May  2 06:21 model.safetensors.index.json
-rw-r--r-- 1 root root 5.5K May  2 06:21 resampler.py
-rw-r--r-- 1 root root  765 May  2 06:21 special_tokens_map.json
-rw-r--r-- 1 root root 3.4K May  2 06:21 tokenizer_config.json
-rw-r--r-- 1 root root 2.0M May  2 06:21 tokenizer.model

May 02 '24 06:05 mzwing

Does your MiniCPM-V-2 folder have tokenizer.json? It is a newly uploaded file in https://huggingface.co/openbmb/MiniCPM-V-2/tree/main.

Yes, I confirm that.

Log:

> ls MiniCPM-V-2/ -alh

total 8.0G
drwxr-xr-x 4 root root 4.0K May  2 06:17 .
drwxr-xr-x 1 root root 4.0K May  2 06:16 ..
drwxr-xr-x 2 root root 4.0K May  2 06:13 assets
-rw-r--r-- 1 root root 1.2K May  2 06:13 config.json
-rw-r--r-- 1 root root  11K May  2 06:13 configuration_minicpm.py
-rw-r--r-- 1 root root  111 May  2 06:13 generation_config.json
-rw-r--r-- 1 root root 1.7K May  2 06:13 .gitattributes
-rw-r--r-- 1 root root 1.5G May  2 06:17 llava.clip
-rw-r--r-- 1 root root 113M May  2 06:17 llava.projector
drwxr-xr-x 2 root root 4.0K May  2 06:21 MiniCPM
-rw-r--r-- 1 root root 4.7G May  2 06:14 model-00001-of-00002.safetensors
-rw-r--r-- 1 root root 1.8G May  2 06:13 model-00002-of-00002.safetensors
-rw-r--r-- 1 root root  70K May  2 06:13 modeling_minicpm.py
-rw-r--r-- 1 root root  20K May  2 06:13 modeling_minicpmv.py
-rw-r--r-- 1 root root  54K May  2 06:13 model.safetensors.index.json
-rw-r--r-- 1 root root 9.2K May  2 06:13 README.md
-rw-r--r-- 1 root root 5.5K May  2 06:13 resampler.py
-rw-r--r-- 1 root root  651 May  2 06:13 special_tokens_map.json
-rw-r--r-- 1 root root 3.3K May  2 06:13 tokenizer_config.json
-rw-r--r-- 1 root root 6.0M May  2 06:13 tokenizer.json
-rw-r--r-- 1 root root 2.0M May  2 06:13 tokenizer.model

> ls MiniCPM-V-2/MiniCPM -alh

total 12G
drwxr-xr-x 2 root root 4.0K May  2 06:21 .
drwxr-xr-x 4 root root 4.0K May  2 06:17 ..
-rw-r--r-- 1 root root 1.5K May  2 06:17 config.json
-rw-r--r-- 1 root root  11K May  2 06:21 configuration_minicpm.py
-rw-r--r-- 1 root root  111 May  2 06:17 generation_config.json
-rw-r--r-- 1 root root 4.7G May  2 06:18 model-00001-of-00003.safetensors
-rw-r--r-- 1 root root 4.7G May  2 06:20 model-00002-of-00003.safetensors
-rw-r--r-- 1 root root 2.0G May  2 06:21 model-00003-of-00003.safetensors
-rw-r--r-- 1 root root  70K May  2 06:21 modeling_minicpm.py
-rw-r--r-- 1 root root  20K May  2 06:21 modeling_minicpmv.py
-rw-r--r-- 1 root root  30K May  2 06:21 model.safetensors.index.json
-rw-r--r-- 1 root root 5.5K May  2 06:21 resampler.py
-rw-r--r-- 1 root root  765 May  2 06:21 special_tokens_map.json
-rw-r--r-- 1 root root 3.4K May  2 06:21 tokenizer_config.json
-rw-r--r-- 1 root root 2.0M May  2 06:21 tokenizer.model

So it seems that the save_pretrained method in surgery.py do not save the tokenizer.json file. I manually copy the tokenizer.json file into the MiniCPM sub-folder before the tokenizer.json is uploaded. I wonder the manually copy is no longer needed since it is uploaded. Seems I was wrong.

May 02 '24 06:05 Achazwl

So it seems that the save_pretrained method in surgery.py do not save the tokenizer.json file. I manually copy the tokenizer.json file into the MiniCPM sub-folder before the tokenizer.json is uploaded. I wonder the manually copy is no longer needed since it is uploaded. Seems I was wrong.

However even if I copy it to sub-folder, the converting script didn't work either :(

See here: https://github.com/ggerganov/llama.cpp/pull/6919#issuecomment-2089645626

cp ../MiniCPM-V-2/tokenizer.json ../MiniCPM-V-2/MiniCPM/

May 02 '24 06:05 mzwing

fixed.

May 03 '24 02:05 Achazwl

Convert successfully, thx!

However, I got a bad test result...

Log here:

> ./minicpmv-cli -ngl 1000000 -m ./MiniCPM-V-2-GGUF/MiniCPM-V-2.F16.gguf --mmproj ./MiniCPM-V-2-GGUF/mmproj-model-f16.gguf -c 4096 --temp 0.6 --top-p 0.8 --top-k 100 --repeat-penalty 1.0 --image ./mzwing.jpg -p "这张图里有什么?"

Log start
clip_model_load: description:  image encoder for LLaVA
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    440
clip_model_load: n_kv:         18
clip_model_load: ftype:        f16

clip_model_load: loaded meta data with 18 key-value pairs and 440 tensors from ./MiniCPM-V-2-GGUF/mmproj-model-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                   clip.has_llava_projector bool             = true
clip_model_load: - kv   4:                          general.file_type u32              = 1
clip_model_load: - kv   5:                        general.description str              = image encoder for LLaVA
clip_model_load: - kv   6:                        clip.projector_type str              = resampler
clip_model_load: - kv   7:                     clip.vision.image_size u32              = 448
clip_model_load: - kv   8:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv   9:               clip.vision.embedding_length u32              = 1152
clip_model_load: - kv  10:            clip.vision.feed_forward_length u32              = 4304
clip_model_load: - kv  11:                 clip.vision.projection_dim u32              = 0
clip_model_load: - kv  12:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  13:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
clip_model_load: - kv  14:                    clip.vision.block_count u32              = 26
clip_model_load: - kv  15:                     clip.vision.image_mean arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  16:                      clip.vision.image_std arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  17:                              clip.use_gelu bool             = true
clip_model_load: - type  f32:  277 tensors
clip_model_load: - type  f16:  163 tensors
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla T4, compute capability 7.5, VMM: yes
clip_model_load: CLIP using CUDA backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  1
clip_model_load: model size:     828.18 MB
clip_model_load: metadata size:  0.17 MB
clip_model_load: params backend buffer size =  828.18 MB (440 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_model_load: compute allocated memory: 88.80 MB
llama_model_loader: loaded meta data with 22 key-value pairs and 363 tensors from ./MiniCPM-V-2-GGUF/MiniCPM-V-2.F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = minicpm
llama_model_loader: - kv   1:                               general.name str              = MiniCPM
llama_model_loader: - kv   2:                     minicpm.context_length u32              = 4096
llama_model_loader: - kv   3:                   minicpm.embedding_length u32              = 2304
llama_model_loader: - kv   4:                        minicpm.block_count u32              = 40
llama_model_loader: - kv   5:                minicpm.feed_forward_length u32              = 5760
llama_model_loader: - kv   6:               minicpm.rope.dimension_count u32              = 64
llama_model_loader: - kv   7:               minicpm.attention.head_count u32              = 36
llama_model_loader: - kv   8:            minicpm.attention.head_count_kv u32              = 36
llama_model_loader: - kv   9:   minicpm.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:                        minicpm.tie_lm_head bool             = false
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,122753]  = ["<unk>", "<s>", "</s>", "<SEP>", "<C...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,122753]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,122753]  = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type  f16:  282 tensors
llm_load_vocab: mismatch in special tokens definition ( 3528/122753 vs 271/122753 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = minicpm
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 122753
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 2304
llm_load_print_meta: n_head           = 36
llm_load_print_meta: n_head_kv        = 36
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 2304
llm_load_print_meta: n_embd_v_gqa     = 2304
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 5760
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 2B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 3.01 B
llm_load_print_meta: model size       = 5.60 GiB (16.00 BPW) 
llm_load_print_meta: general.name     = MiniCPM
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 1099 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.37 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors:        CPU buffer size =   539.44 MiB
llm_load_tensors:      CUDA0 buffer size =  5197.65 MiB
....................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1440.00 MiB
llama_new_context_with_model: KV self size  = 1440.00 MiB, K (f16):  720.00 MiB, V (f16):  720.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.47 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   314.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    12.51 MiB
llama_new_context_with_model: graph nodes  = 1368
llama_new_context_with_model: graph splits = 2
encode_image_with_clip: image embedding created: 64 tokens

encode_image_with_clip: image encoded in   455.73 ms by CLIP (    7.12 ms per image patch)
slice_image: multiple 1
<用户>这张图里有什么?
<AI>
《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《

llama_print_timings:        load time =   27998.25 ms
llama_print_timings:      sample time =      35.94 ms /   256 runs   (    0.14 ms per token,  7122.19 tokens per second)
llama_print_timings: prompt eval time =     196.38 ms /    80 tokens (    2.45 ms per token,   407.36 tokens per second)
llama_print_timings:        eval time =    7742.32 ms /   255 runs   (   30.36 ms per token,    32.94 tokens per second)
llama_print_timings:       total time =   36056.03 ms /   335 tokens

The image I used for testing is my GitHub avatar.

mzwing

May 04 '24 11:05 mzwing

My log here, can not reproduce your result

./minicpmv-cli -m ../MiniCPM-V-2/MiniCPM/ggml-model-f16.gguf --mmproj ../MiniCPM-V-2/mmproj-model-f16.gguf -c 4096 --temp 0.6 --top-p 0.8 --top-k 100 --repeat-penalty 1.0 --image ../mzwing.jpg -p "这张图里有什么?"
Log start
clip_model_load: description:  image encoder for LLaVA
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    440
clip_model_load: n_kv:         18
clip_model_load: ftype:        f16

clip_model_load: loaded meta data with 18 key-value pairs and 440 tensors from ../MiniCPM-V-2/mmproj-model-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                   clip.has_llava_projector bool             = true
clip_model_load: - kv   4:                          general.file_type u32              = 1
clip_model_load: - kv   5:                        general.description str              = image encoder for LLaVA
clip_model_load: - kv   6:                        clip.projector_type str              = resampler
clip_model_load: - kv   7:                     clip.vision.image_size u32              = 448
clip_model_load: - kv   8:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv   9:               clip.vision.embedding_length u32              = 1152
clip_model_load: - kv  10:            clip.vision.feed_forward_length u32              = 4304
clip_model_load: - kv  11:                 clip.vision.projection_dim u32              = 0
clip_model_load: - kv  12:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  13:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
clip_model_load: - kv  14:                    clip.vision.block_count u32              = 26
clip_model_load: - kv  15:                     clip.vision.image_mean arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  16:                      clip.vision.image_std arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  17:                              clip.use_gelu bool             = true
clip_model_load: - type  f32:  277 tensors
clip_model_load: - type  f16:  163 tensors
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/acha/Desktop/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M1 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 22906.50 MB
clip_model_load: CLIP using Metal backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  1
clip_model_load: model size:     828.18 MB
clip_model_load: metadata size:  0.17 MB
clip_model_load: params backend buffer size =  828.18 MB (440 tensors)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   828.19 MiB, (  829.19 / 21845.34)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =    88.81 MiB, (  918.00 / 21845.34)
clip_model_load: compute allocated memory: 88.80 MB
llama_model_loader: loaded meta data with 22 key-value pairs and 363 tensors from ../MiniCPM-V-2/MiniCPM/ggml-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = minicpm
llama_model_loader: - kv   1:                               general.name str              = MiniCPM
llama_model_loader: - kv   2:                     minicpm.context_length u32              = 4096
llama_model_loader: - kv   3:                   minicpm.embedding_length u32              = 2304
llama_model_loader: - kv   4:                        minicpm.block_count u32              = 40
llama_model_loader: - kv   5:                minicpm.feed_forward_length u32              = 5760
llama_model_loader: - kv   6:               minicpm.rope.dimension_count u32              = 64
llama_model_loader: - kv   7:               minicpm.attention.head_count u32              = 36
llama_model_loader: - kv   8:            minicpm.attention.head_count_kv u32              = 36
llama_model_loader: - kv   9:   minicpm.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:                        minicpm.tie_lm_head bool             = false
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,122753]  = ["<unk>", "<s>", "</s>", "<SEP>", "<C...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,122753]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,122753]  = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type  f16:  282 tensors
llm_load_vocab: mismatch in special tokens definition ( 3528/122753 vs 271/122753 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = minicpm
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 122753
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 2304
llm_load_print_meta: n_head           = 36
llm_load_print_meta: n_head_kv        = 36
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 2304
llm_load_print_meta: n_embd_v_gqa     = 2304
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 5760
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 2B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 3.01 B
llm_load_print_meta: model size       = 5.60 GiB (16.00 BPW) 
llm_load_print_meta: general.name     = MiniCPM
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 1099 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.37 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =  5197.67 MiB, ( 6115.67 / 21845.34)
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors:      Metal buffer size =  5197.66 MiB
llm_load_tensors:        CPU buffer size =   539.44 MiB
...................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/acha/Desktop/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M1 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 22906.50 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =  1440.00 MiB, ( 7556.42 / 21845.34)
llama_kv_cache_init:      Metal KV buffer size =  1440.00 MiB
llama_new_context_with_model: KV self size  = 1440.00 MiB, K (f16):  720.00 MiB, V (f16):  720.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.47 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   314.02 MiB, ( 7870.44 / 21845.34)
llama_new_context_with_model:      Metal compute buffer size =   314.00 MiB
llama_new_context_with_model:        CPU compute buffer size =    12.51 MiB
llama_new_context_with_model: graph nodes  = 1368
llama_new_context_with_model: graph splits = 2
encode_image_with_clip: image embedding created: 64 tokens

encode_image_with_clip: image encoded in  2532.64 ms by CLIP (   39.57 ms per image patch)
slice_image: multiple 1
<用户>这张图里有什么?
<AI>
这幅图片展示了一个人站在一个看起来像围栏的地方，周围是蓝色的海洋。这个人正在伸手去触碰天空中的鸟群，这些鸟群以一种抽象的方式排列成一条线。这幅画的风格是水彩，给人一种梦幻、宁静的感觉。颜色以蓝色和白色为主，蓝色象征着海洋和天空，白色则代表云彩和鸟群。

llama_print_timings:        load time =   11091.30 ms
llama_print_timings:      sample time =       5.67 ms /    74 runs   (    0.08 ms per token, 13053.45 tokens per second)
llama_print_timings: prompt eval time =    8420.87 ms /    80 tokens (  105.26 ms per token,     9.50 tokens per second)
llama_print_timings:        eval time =    2901.03 ms /    73 runs   (   39.74 ms per token,    25.16 tokens per second)
llama_print_timings:       total time =   14061.83 ms /   153 tokens
ggml_metal_free: deallocating
ggml_metal_free: deallocating

May 05 '24 09:05 Achazwl

@Achazwl Can you help test the model I quantized? Link here: https://huggingface.co/mzwing/MiniCPM-V-2-GGUF

May 05 '24 10:05 mzwing

@Achazwl Can you test the model I quantized? Link here: https://huggingface.co/mzwing/MiniCPM-V-2-GGUF

The link you provided only contains fp16 models

May 05 '24 10:05 Achazwl

The link you provided only contains fp16 models

The mmproj gguf model is actually there, I just rename it :)

Link to the mmproj gguf model: https://huggingface.co/mzwing/MiniCPM-V-2-GGUF/blob/main/MiniCPM-V-2-mmproj.F16.gguf

May 05 '24 14:05 mzwing

Also correct

./minicpmv-cli -m ../MiniCPM-V-2-GGUF/MiniCPM-V-2.F16.gguf --mmproj ../MiniCPM-V-2-GGUF/MiniCPM-V-2-mmproj.F16.gguf -c 4096 --temp 0.6 --top-p 0.8 --top-k 100 --repeat-penalty 1.0 --image ../mzwing.jpg -p "这张图里有什么?" 
Log start
clip_model_load: description:  image encoder for LLaVA
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    440
clip_model_load: n_kv:         18
clip_model_load: ftype:        f16

clip_model_load: loaded meta data with 18 key-value pairs and 440 tensors from ../MiniCPM-V-2-GGUF/MiniCPM-V-2-mmproj.F16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                   clip.has_llava_projector bool             = true
clip_model_load: - kv   4:                          general.file_type u32              = 1
clip_model_load: - kv   5:                        general.description str              = image encoder for LLaVA
clip_model_load: - kv   6:                        clip.projector_type str              = resampler
clip_model_load: - kv   7:                     clip.vision.image_size u32              = 448
clip_model_load: - kv   8:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv   9:               clip.vision.embedding_length u32              = 1152
clip_model_load: - kv  10:            clip.vision.feed_forward_length u32              = 4304
clip_model_load: - kv  11:                 clip.vision.projection_dim u32              = 0
clip_model_load: - kv  12:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  13:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
clip_model_load: - kv  14:                    clip.vision.block_count u32              = 26
clip_model_load: - kv  15:                     clip.vision.image_mean arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  16:                      clip.vision.image_std arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  17:                              clip.use_gelu bool             = true
clip_model_load: - type  f32:  277 tensors
clip_model_load: - type  f16:  163 tensors
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/acha/Desktop/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M1 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 22906.50 MB
clip_model_load: CLIP using Metal backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  1
clip_model_load: model size:     828.18 MB
clip_model_load: metadata size:  0.17 MB
clip_model_load: params backend buffer size =  828.18 MB (440 tensors)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   828.19 MiB, (  829.19 / 21845.34)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =    88.81 MiB, (  918.00 / 21845.34)
clip_model_load: compute allocated memory: 88.80 MB
llama_model_loader: loaded meta data with 22 key-value pairs and 363 tensors from ../MiniCPM-V-2-GGUF/MiniCPM-V-2.F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = minicpm
llama_model_loader: - kv   1:                               general.name str              = MiniCPM
llama_model_loader: - kv   2:                     minicpm.context_length u32              = 4096
llama_model_loader: - kv   3:                   minicpm.embedding_length u32              = 2304
llama_model_loader: - kv   4:                        minicpm.block_count u32              = 40
llama_model_loader: - kv   5:                minicpm.feed_forward_length u32              = 5760
llama_model_loader: - kv   6:               minicpm.rope.dimension_count u32              = 64
llama_model_loader: - kv   7:               minicpm.attention.head_count u32              = 36
llama_model_loader: - kv   8:            minicpm.attention.head_count_kv u32              = 36
llama_model_loader: - kv   9:   minicpm.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:                        minicpm.tie_lm_head bool             = false
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,122753]  = ["<unk>", "<s>", "</s>", "<SEP>", "<C...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,122753]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,122753]  = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type  f16:  282 tensors
llm_load_vocab: mismatch in special tokens definition ( 3528/122753 vs 271/122753 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = minicpm
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 122753
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 2304
llm_load_print_meta: n_head           = 36
llm_load_print_meta: n_head_kv        = 36
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 2304
llm_load_print_meta: n_embd_v_gqa     = 2304
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 5760
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 2B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 3.01 B
llm_load_print_meta: model size       = 5.60 GiB (16.00 BPW) 
llm_load_print_meta: general.name     = MiniCPM
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 1099 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.37 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =  5197.67 MiB, ( 6115.67 / 21845.34)
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors:      Metal buffer size =  5197.66 MiB
llm_load_tensors:        CPU buffer size =   539.44 MiB
...................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/acha/Desktop/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M1 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 22906.50 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =  1440.00 MiB, ( 7556.42 / 21845.34)
llama_kv_cache_init:      Metal KV buffer size =  1440.00 MiB
llama_new_context_with_model: KV self size  = 1440.00 MiB, K (f16):  720.00 MiB, V (f16):  720.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.47 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   314.02 MiB, ( 7870.44 / 21845.34)
llama_new_context_with_model:      Metal compute buffer size =   314.00 MiB
llama_new_context_with_model:        CPU compute buffer size =    12.51 MiB
llama_new_context_with_model: graph nodes  = 1368
llama_new_context_with_model: graph splits = 2
encode_image_with_clip: image embedding created: 64 tokens

encode_image_with_clip: image encoded in  2460.77 ms by CLIP (   38.45 ms per image patch)
slice_image: multiple 1
<用户>这张图里有什么?
<AI>
这张图片描绘了一个人站在一个看起来像是栏杆的地方，朝向天空。这个人似乎正在伸手向天空，可能是在试图捕捉或触摸星星或鸟儿。天空是深蓝色的，点缀着许多星星和散落的鸟群，给人一种浩瀚和宁静的感觉。这幅画采用了水彩画风格，柔和的水彩笔触营造出一种梦幻般、略带忧郁的氛围。

llama_print_timings:        load time =    9274.04 ms
llama_print_timings:      sample time =       5.80 ms /    76 runs   (    0.08 ms per token, 13103.45 tokens per second)
llama_print_timings: prompt eval time =    6685.98 ms /    80 tokens (   83.57 ms per token,    11.97 tokens per second)
llama_print_timings:        eval time =    2971.87 ms /    75 runs   (   39.62 ms per token,    25.24 tokens per second)
llama_print_timings:       total time =   12316.09 ms /   155 tokens
ggml_metal_free: deallocating
ggml_metal_free: deallocating

May 06 '24 01:05 Achazwl

I did some further tests.

When I use only the CPU, the model's output is very, very normal. However, when I switching to the GPU, the model seemed... mad.

Tested on Google Colab (T4 GPU).

Log:

> ./minicpmv-cli -ngl 35 -m ./MiniCPM-V-2-GGUF/MiniCPM-V-2.Q2_K.gguf --mmproj ./MiniCPM-V-2-GGUF/MiniCPM-V-2-mmproj.F16.gguf -c 4096 --temp 0.6 --top-p 0.8 --top-k 100 --repeat-penalty 1.0 --image ./mzwing.jpg -p "这张图里有什么?"

Log start
clip_model_load: description:  image encoder for LLaVA
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    440
clip_model_load: n_kv:         18
clip_model_load: ftype:        f16

clip_model_load: loaded meta data with 18 key-value pairs and 440 tensors from ./MiniCPM-V-2-GGUF/MiniCPM-V-2-mmproj.F16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                   clip.has_llava_projector bool             = true
clip_model_load: - kv   4:                          general.file_type u32              = 1
clip_model_load: - kv   5:                        general.description str              = image encoder for LLaVA
clip_model_load: - kv   6:                        clip.projector_type str              = resampler
clip_model_load: - kv   7:                     clip.vision.image_size u32              = 448
clip_model_load: - kv   8:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv   9:               clip.vision.embedding_length u32              = 1152
clip_model_load: - kv  10:            clip.vision.feed_forward_length u32              = 4304
clip_model_load: - kv  11:                 clip.vision.projection_dim u32              = 0
clip_model_load: - kv  12:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  13:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
clip_model_load: - kv  14:                    clip.vision.block_count u32              = 26
clip_model_load: - kv  15:                     clip.vision.image_mean arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  16:                      clip.vision.image_std arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  17:                              clip.use_gelu bool             = true
clip_model_load: - type  f32:  277 tensors
clip_model_load: - type  f16:  163 tensors
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla T4, compute capability 7.5, VMM: yes
clip_model_load: CLIP using CUDA backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  1
clip_model_load: model size:     828.18 MB
clip_model_load: metadata size:  0.17 MB
clip_model_load: params backend buffer size =  828.18 MB (440 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_model_load: compute allocated memory: 88.80 MB
llama_model_loader: loaded meta data with 23 key-value pairs and 363 tensors from ./MiniCPM-V-2-GGUF/MiniCPM-V-2.Q2_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = minicpm
llama_model_loader: - kv   1:                               general.name str              = MiniCPM
llama_model_loader: - kv   2:                     minicpm.context_length u32              = 4096
llama_model_loader: - kv   3:                   minicpm.embedding_length u32              = 2304
llama_model_loader: - kv   4:                        minicpm.block_count u32              = 40
llama_model_loader: - kv   5:                minicpm.feed_forward_length u32              = 5760
llama_model_loader: - kv   6:               minicpm.rope.dimension_count u32              = 64
llama_model_loader: - kv   7:               minicpm.attention.head_count u32              = 36
llama_model_loader: - kv   8:            minicpm.attention.head_count_kv u32              = 36
llama_model_loader: - kv   9:   minicpm.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 10
llama_model_loader: - kv  11:                        minicpm.tie_lm_head bool             = false
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,122753]  = ["<unk>", "<s>", "</s>", "<SEP>", "<C...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,122753]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,122753]  = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q2_K:  161 tensors
llama_model_loader: - type q3_K:   80 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq4_nl:   40 tensors
llm_load_vocab: mismatch in special tokens definition ( 3528/122753 vs 271/122753 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = minicpm
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 122753
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 2304
llm_load_print_meta: n_head           = 36
llm_load_print_meta: n_head_kv        = 36
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 2304
llm_load_print_meta: n_embd_v_gqa     = 2304
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 5760
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 2B
llm_load_print_meta: model ftype      = Q2_K - Medium
llm_load_print_meta: model params     = 3.01 B
llm_load_print_meta: model size       = 1.21 GiB (3.44 BPW) 
llm_load_print_meta: general.name     = MiniCPM
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 1099 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.37 MiB
llm_load_tensors: offloading 35 repeating layers to GPU
llm_load_tensors: offloaded 35/41 layers to GPU
llm_load_tensors:        CPU buffer size =  1234.38 MiB
llm_load_tensors:      CUDA0 buffer size =   809.07 MiB
.............................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   180.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =  1260.00 MiB
llama_new_context_with_model: KV self size  = 1440.00 MiB, K (f16):  720.00 MiB, V (f16):  720.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.47 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   465.51 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    17.01 MiB
llama_new_context_with_model: graph nodes  = 1368
llama_new_context_with_model: graph splits = 59
encode_image_with_clip: image embedding created: 64 tokens

encode_image_with_clip: image encoded in   394.83 ms by CLIP (    6.17 ms per image patch)
slice_image: multiple 1
<用户>这张图里有什么?
<AI>
％</br></h3><strong></h2></tr><br/><SEP>－《<li><SEP></h2><h3/>８--></h3>』</h1>？８＜</br></strong>３２<h4/>=－</h3>１０</h2><tbody/>．<h3></img><h5>『７</h2><img/></h3><tr>２￥３<h4>『１１</h5><CLS><li></img></h3>』《『＜『％３<li>０<h2/>３！１<tr><h5>?</img><SEP></h1><h4/><CLS>=<h4>```.=<h3/>？２<!--％！-１<td>２㊣<p/><p/><SEP>。<!--￥１。。。</td>》。...<li>
<h5></h5><h2/>㊣？</img>：＜</h5><h3>，?.<strong></tr><tr></strong></tbody>３<h1>４-->?-【</li></tr><h3>１２<li/>.</h2><SEP>３</h2>?<table>：<br></tbody><h2/><!DOCTYPE>￥９``````=</img></h5><b><h5/>.<li>￥』《</li>-４?<li>！％『<img/><br/></h1>！》<tr>.．<table></br></h1>＜，!《</h2>㊣</h4></tbody>￥</li>。。。。。。。。。。。。<table/></br>-<li/>．／．：＜《</h2>、<h5>－＜<h4/>％</li>１</strong>、</strong><br><h4/>-->《</h4>...<strong/>.<b/>--><tbody/><h4/>，？，『０<img/>【</strong>
９、<tr>-５
－</h5>％<p/><h4/><h5><!DOCTYPE><table/>《６</h5></tr>

llama_print_timings:        load time =    2538.88 ms
llama_print_timings:      sample time =      32.09 ms /   256 runs   (    0.13 ms per token,  7978.56 tokens per second)
llama_print_timings: prompt eval time =    1714.47 ms /    80 tokens (   21.43 ms per token,    46.66 tokens per second)
llama_print_timings:        eval time =   19998.40 ms /   255 runs   (   78.43 ms per token,    12.75 tokens per second)
llama_print_timings:       total time =   22909.54 ms /   335 tokens

The binary I compiled is here: https://github.com/MZWNET/actions/releases/tag/llama_cpp-minicpm-v-6c1c4b4

Link to models I quantized: https://huggingface.co/mzwing/MiniCPM-V-2-GGUF

If you need, link to my Jupyter Notebook file here: https://github.com/mzwing/AI-related/blob/master/notebooks/MiniCPM_V_2_GGUF.ipynb

So, it seems to be a GPU-related bug :(

May 06 '24 05:05 mzwing

So, it seems to be a GPU-related bug :(

So this may not related to my PR? Since the correctness on CPU indicates that the conversion is correct.

May 06 '24 06:05 Achazwl

So this may not related to my PR? Since the correctness on CPU indicates that the conversion is correct.

I'm afraid not... This bug only appears when chatting with MiniCPM-V-2 using GPU...

May 06 '24 07:05 mzwing

Is the bug happening on LLaVA?

May 08 '24 02:05 Achazwl

Is the bug happening on LLaVA?

Oh now I find that the llava-cli built in this PR even cannot load the model. It gave out the unable to load model error.

~~For now only tested on GPU env.~~ See comment below.

So, maybe that's the final reason? But the two errors seem too different.

Log:

> ./llava-cli -ngl 35 -m ./LLaVA-Llama-3-8B-Instruct-GGUF/llava-llama3-8b-Q4_K_M.gguf --mmproj ./LLaVA-Llama-3-8B-Instruct-GGUF/llava-llama3-mmproj-f16.gguf --temp 0.6 --top-p 0.8 --top-k 100 --repeat-penalty 1.0 --image ./mzwing.jpg -p "这张图里有什么?"

Log start
clip_model_load: model name:   openai/clip-vit-large-patch14-336
clip_model_load: description:  image encoder for LLaVA
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    377
clip_model_load: n_kv:         19
clip_model_load: ftype:        f16

clip_model_load: loaded meta data with 19 key-value pairs and 377 tensors from ./LLaVA-Llama-3-8B-Instruct-GGUF/llava-llama3-mmproj-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                   clip.has_llava_projector bool             = true
clip_model_load: - kv   4:                          general.file_type u32              = 1
clip_model_load: - kv   5:                               general.name str              = openai/clip-vit-large-patch14-336
clip_model_load: - kv   6:                        general.description str              = image encoder for LLaVA
clip_model_load: - kv   7:                        clip.projector_type str              = mlp
clip_model_load: - kv   8:                     clip.vision.image_size u32              = 336
clip_model_load: - kv   9:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv  10:               clip.vision.embedding_length u32              = 1024
clip_model_load: - kv  11:            clip.vision.feed_forward_length u32              = 4096
clip_model_load: - kv  12:                 clip.vision.projection_dim u32              = 768
clip_model_load: - kv  13:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  14:   clip.vision.attention.layer_norm_epsilon f32              = 0.000010
clip_model_load: - kv  15:                    clip.vision.block_count u32              = 23
clip_model_load: - kv  16:                     clip.vision.image_mean arr[f32,3]       = [0.481455, 0.457828, 0.408211]
clip_model_load: - kv  17:                      clip.vision.image_std arr[f32,3]       = [0.268630, 0.261303, 0.275777]
clip_model_load: - kv  18:                              clip.use_gelu bool             = false
clip_model_load: - type  f32:  235 tensors
clip_model_load: - type  f16:  142 tensors
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla T4, compute capability 7.5, VMM: yes
clip_model_load: CLIP using CUDA backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  1
clip_model_load: model size:     595.49 MB
clip_model_load: metadata size:  0.14 MB
clip_model_load: params backend buffer size =  595.49 MB (377 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_model_load: compute allocated memory: 32.89 MB
llama_model_loader: loaded meta data with 23 key-value pairs and 291 tensors from ./LLaVA-Llama-3-8B-Instruct-GGUF/llava-llama3-8b-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = tmp
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128257
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 15
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,128257]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,128257]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128257]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 128256
llama_model_loader: - kv  21:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 257/128257 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128257
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.58 GiB (4.89 BPW) 
llm_load_print_meta: general.name     = tmp
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: PAD token        = 128256 '<pad>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_tensors: ggml ctx size =    0.30 MiB
llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 291, got 290
llama_load_model_from_file: failed to load model
llava_init: error: unable to load model
main: error: failed to init llava

May 08 '24 05:05 mzwing

For now only tested on GPU env.

CPU also the same.

So it's quite confusing. Maybe you should update your branch?

llava-cli in original llama.cpp repo (master branch) works as expected, either in CPU env or GPU env.

May 08 '24 05:05 mzwing

For now only tested on GPU env.

CPU also the same.

So it's quite confusing. Maybe you should update your branch?

llava-cli in original llama.cpp repo (master branch) works as expected, either in CPU env or GPU env.

Llava fixed, it is the side effect of my code. The new version of my PR has much fewer modifications outside the minicpm-v folder, and thus will not affect other models now.

The bug of MiniCPM-V on GPU is rather hard to find. I can reproduce the NaN issue on GPU, and here's are my observation:

The output of the ViT when processing images is aligned with the CPU version (which means the ViT part is correct).
The output of the LLM when processing prompt text is aligned with the CPU version (which means LLM's computation is correct on GPU).
However, when putting the output of ViT into the LLM as LLM's input, NaN is outputted.
I finally find that once the output of ViT is fed into the text model, it immediately becomes NaN, which means it has already turned into NaN at the input embedding stage (input_embed), without calculating any TransformerBlock.
In the function where the ViT input is copied from the CPU to the GPU (the ggml_backend_cuda_buffer_set_tensor function in ggml-cuda.cu), I add a debug code to copy back the input_embed to the CPU. The result copied back to the CPU is the same as the output of ViT, no NaN is appearing. However, I can't figure out what more is happening in the code between "ViT output" and "LLM input", I can only find the CPU->GPU copying. If NaN is not from this stage, then from what?
I also attempted to allocate a new space to copy the output of ViT into, in order to avoid some "access out of bound" issues, but the result was still NaN.

May 12 '24 08:05 Achazwl

@cmp-nct Hey cmp-nct, could you please help us resolve this confusing issue? Thanks a lot!

I apologize if this has caused you confusion.

May 13 '24 05:05 mzwing

你好，我试验了一下，我量化之后发小效果差很多，有什么解决方法吗？

May 17 '24 01:05 sunzhe09