support MiniCPM-V-2
I created a folder called "minicpmv" in the examples folder of llama.cpp.
More detail can be seen in llama.cpp/examples/minicpmv/README.md.
The code is based on examples/llava but the vision part is quite different.
๐ llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 555 iterations ๐
Expand details for performance related PR only
- Concurrent users: 8, duration: 10m
- HTTP request : avg=8400.3ms p(95)=19734.55ms fails=, finish reason: stop=511 truncated=44
- Prompt processing (pp): avg=103.65tk/s p(95)=452.32tk/s
- Token generation (tg): avg=34.33tk/s p(95)=46.58tk/s
- ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=feat-minicpmv commit=70a23863dcff7458839960b304ea166a401d4d8e
More
---
config:
xyChart:
titleFontSize: 12
width: 900
height: 600
themeVariables:
xyChart:
titleColor: "#000000"
---
xychart-beta
title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
duration=10m 555 iterations"
y-axis "llamacpp:prompt_tokens_seconds"
x-axis "llamacpp:prompt_tokens_seconds" 1716473680 --> 1716474300
line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 399.43, 399.43, 399.43, 399.43, 399.43, 910.28, 910.28, 910.28, 910.28, 910.28, 794.64, 794.64, 794.64, 794.64, 794.64, 806.68, 806.68, 806.68, 806.68, 806.68, 871.83, 871.83, 871.83, 871.83, 871.83, 880.42, 880.42, 880.42, 880.42, 880.42, 871.52, 871.52, 871.52, 871.52, 871.52, 884.07, 884.07, 884.07, 884.07, 884.07, 893.5, 893.5, 893.5, 893.5, 893.5, 904.16, 904.16, 904.16, 904.16, 904.16, 895.77, 895.77, 895.77, 895.77, 895.77, 893.23, 893.23, 893.23, 893.23, 893.23, 911.86, 911.86, 911.86, 911.86, 911.86, 914.39, 914.39, 914.39, 914.39, 914.39, 903.48, 903.48, 903.48, 903.48, 903.48, 881.01, 881.01, 881.01, 881.01, 881.01, 884.16, 884.16, 884.16, 884.16, 884.16, 876.15, 876.15, 876.15, 876.15, 876.15, 873.63, 873.63, 873.63, 873.63, 873.63, 831.29, 831.29, 831.29, 831.29, 831.29, 832.56, 832.56, 832.56, 832.56, 832.56, 838.73, 838.73, 838.73, 838.73, 838.73, 842.35, 842.35, 842.35, 842.35, 842.35, 852.1, 852.1, 852.1, 852.1, 852.1, 828.17, 828.17, 828.17, 828.17, 828.17, 831.51, 831.51, 831.51, 831.51, 831.51, 833.09, 833.09, 833.09, 833.09, 833.09, 812.09, 812.09, 812.09, 812.09, 812.09, 811.03, 811.03, 811.03, 811.03, 811.03, 812.54, 812.54, 812.54, 812.54, 812.54, 819.03, 819.03, 819.03, 819.03, 819.03, 818.86, 818.86, 818.86, 818.86, 818.86, 818.31, 818.31, 818.31, 818.31, 818.31, 828.36, 828.36, 828.36, 828.36, 828.36, 822.09, 822.09, 822.09, 822.09, 822.09, 827.13, 827.13, 827.13, 827.13, 827.13, 805.06, 805.06, 805.06, 805.06, 805.06, 803.22, 803.22, 803.22, 803.22, 803.22, 805.52, 805.52, 805.52, 805.52, 805.52, 808.35, 808.35, 808.35, 808.35, 808.35, 809.36, 809.36, 809.36, 809.36, 809.36, 811.4, 811.4, 811.4, 811.4, 811.4, 808.28, 808.28, 808.28, 808.28, 808.28, 796.46, 796.46, 796.46, 796.46, 796.46, 796.57, 796.57, 796.57, 796.57, 796.57, 796.1, 796.1, 796.1, 796.1, 796.1, 801.6, 801.6, 801.6, 801.6, 801.6, 804.22, 804.22, 804.22, 804.22, 804.22, 803.76, 803.76, 803.76, 803.76, 803.76, 809.35, 809.35, 809.35, 809.35, 809.35, 809.22, 809.22, 809.22, 809.22, 809.22, 812.61, 812.61, 812.61, 812.61, 812.61, 807.83, 807.83, 807.83, 807.83, 807.83, 814.96, 814.96, 814.96, 814.96, 814.96, 814.12, 814.12, 814.12, 814.12, 814.12, 815.86, 815.86, 815.86, 815.86, 815.86, 816.03, 816.03, 816.03, 816.03, 816.03, 816.02, 816.02, 816.02, 816.02, 816.02, 818.09, 818.09, 818.09, 818.09, 818.09, 821.09, 821.09, 821.09, 821.09, 821.09, 820.36]
More
---
config:
xyChart:
titleFontSize: 12
width: 900
height: 600
themeVariables:
xyChart:
titleColor: "#000000"
---
xychart-beta
title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
duration=10m 555 iterations"
y-axis "llamacpp:predicted_tokens_seconds"
x-axis "llamacpp:predicted_tokens_seconds" 1716473680 --> 1716474300
line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 37.39, 37.39, 37.39, 37.39, 37.39, 40.72, 40.72, 40.72, 40.72, 40.72, 28.9, 28.9, 28.9, 28.9, 28.9, 29.61, 29.61, 29.61, 29.61, 29.61, 30.2, 30.2, 30.2, 30.2, 30.2, 30.31, 30.31, 30.31, 30.31, 30.31, 30.85, 30.85, 30.85, 30.85, 30.85, 31.48, 31.48, 31.48, 31.48, 31.48, 32.55, 32.55, 32.55, 32.55, 32.55, 32.55, 32.55, 32.55, 32.55, 32.55, 32.42, 32.42, 32.42, 32.42, 32.42, 32.5, 32.5, 32.5, 32.5, 32.5, 32.35, 32.35, 32.35, 32.35, 32.35, 31.12, 31.12, 31.12, 31.12, 31.12, 30.82, 30.82, 30.82, 30.82, 30.82, 29.78, 29.78, 29.78, 29.78, 29.78, 29.42, 29.42, 29.42, 29.42, 29.42, 28.62, 28.62, 28.62, 28.62, 28.62, 28.8, 28.8, 28.8, 28.8, 28.8, 29.35, 29.35, 29.35, 29.35, 29.35, 29.18, 29.18, 29.18, 29.18, 29.18, 29.19, 29.19, 29.19, 29.19, 29.19, 29.38, 29.38, 29.38, 29.38, 29.38, 29.63, 29.63, 29.63, 29.63, 29.63, 29.59, 29.59, 29.59, 29.59, 29.59, 29.81, 29.81, 29.81, 29.81, 29.81, 30.06, 30.06, 30.06, 30.06, 30.06, 30.13, 30.13, 30.13, 30.13, 30.13, 30.11, 30.11, 30.11, 30.11, 30.11, 30.21, 30.21, 30.21, 30.21, 30.21, 30.62, 30.62, 30.62, 30.62, 30.62, 30.79, 30.79, 30.79, 30.79, 30.79, 30.87, 30.87, 30.87, 30.87, 30.87, 31.09, 31.09, 31.09, 31.09, 31.09, 31.02, 31.02, 31.02, 31.02, 31.02, 30.93, 30.93, 30.93, 30.93, 30.93, 30.73, 30.73, 30.73, 30.73, 30.73, 30.5, 30.5, 30.5, 30.5, 30.5, 30.77, 30.77, 30.77, 30.77, 30.77, 30.91, 30.91, 30.91, 30.91, 30.91, 30.98, 30.98, 30.98, 30.98, 30.98, 31.07, 31.07, 31.07, 31.07, 31.07, 30.88, 30.88, 30.88, 30.88, 30.88, 30.66, 30.66, 30.66, 30.66, 30.66, 30.19, 30.19, 30.19, 30.19, 30.19, 29.42, 29.42, 29.42, 29.42, 29.42, 29.5, 29.5, 29.5, 29.5, 29.5, 29.49, 29.49, 29.49, 29.49, 29.49, 29.46, 29.46, 29.46, 29.46, 29.46, 29.47, 29.47, 29.47, 29.47, 29.47, 29.57, 29.57, 29.57, 29.57, 29.57, 29.58, 29.58, 29.58, 29.58, 29.58, 29.57, 29.57, 29.57, 29.57, 29.57, 29.52, 29.52, 29.52, 29.52, 29.52, 29.52, 29.52, 29.52, 29.52, 29.52, 29.69, 29.69, 29.69, 29.69, 29.69, 29.79, 29.79, 29.79, 29.79, 29.79, 29.92, 29.92, 29.92, 29.92, 29.92, 30.01, 30.01, 30.01, 30.01, 30.01, 30.05, 30.05, 30.05, 30.05, 30.05, 30.09]
Details
More
---
config:
xyChart:
titleFontSize: 12
width: 900
height: 600
themeVariables:
xyChart:
titleColor: "#000000"
---
xychart-beta
title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
duration=10m 555 iterations"
y-axis "llamacpp:kv_cache_usage_ratio"
x-axis "llamacpp:kv_cache_usage_ratio" 1716473680 --> 1716474300
line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.13, 0.13, 0.13, 0.13, 0.13, 0.34, 0.34, 0.34, 0.34, 0.34, 0.28, 0.28, 0.28, 0.28, 0.28, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.25, 0.25, 0.25, 0.25, 0.25, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.23, 0.23, 0.23, 0.23, 0.23, 0.24, 0.24, 0.24, 0.24, 0.24, 0.21, 0.21, 0.21, 0.21, 0.21, 0.38, 0.38, 0.38, 0.38, 0.38, 0.24, 0.24, 0.24, 0.24, 0.24, 0.4, 0.4, 0.4, 0.4, 0.4, 0.33, 0.33, 0.33, 0.33, 0.33, 0.3, 0.3, 0.3, 0.3, 0.3, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.28, 0.28, 0.28, 0.28, 0.28, 0.22, 0.22, 0.22, 0.22, 0.22, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.24, 0.24, 0.24, 0.24, 0.24, 0.1, 0.1, 0.1, 0.1, 0.1, 0.14, 0.14, 0.14, 0.14, 0.14, 0.17, 0.17, 0.17, 0.17, 0.17, 0.28, 0.28, 0.28, 0.28, 0.28, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.1, 0.1, 0.1, 0.1, 0.1, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.29, 0.29, 0.29, 0.29, 0.29, 0.37, 0.37, 0.37, 0.37, 0.37, 0.1, 0.1, 0.1, 0.1, 0.1, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12, 0.12, 0.16, 0.16, 0.16, 0.16, 0.16, 0.3, 0.3, 0.3, 0.3, 0.3, 0.43, 0.43, 0.43, 0.43, 0.43, 0.62, 0.62, 0.62, 0.62, 0.62, 0.37, 0.37, 0.37, 0.37, 0.37, 0.09, 0.09, 0.09, 0.09, 0.09, 0.18, 0.18, 0.18, 0.18, 0.18, 0.27, 0.27, 0.27, 0.27, 0.27, 0.24, 0.24, 0.24, 0.24, 0.24, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.25, 0.25, 0.25, 0.25, 0.25, 0.29, 0.29, 0.29, 0.29, 0.29, 0.26, 0.26, 0.26, 0.26, 0.26, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19, 0.19, 0.19, 0.1]
More
---
config:
xyChart:
titleFontSize: 12
width: 900
height: 600
themeVariables:
xyChart:
titleColor: "#000000"
---
xychart-beta
title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
duration=10m 555 iterations"
y-axis "llamacpp:requests_processing"
x-axis "llamacpp:requests_processing" 1716473680 --> 1716474300
line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0, 3.0]
Encounter some bugs when building this PR on Windows.
Log here: https://github.com/MZWNET/actions/actions/runs/8896316696/job/24430759871
Encounter some bugs when building this PR on Windows.
Log here: https://github.com/MZWNET/actions/actions/runs/8896316696/job/24430759871
Can 6c1c4b4 fix this?
Get another bug when converting the image encoder to gguf.
Log:
python3 ./examples/minicpmv/convert-image-encoder-to-gguf.py -m ../MiniCPM-V-2 --llava-projector ../MiniCPM-V-2/llava.projector --output-dir ../MiniCPM-V-2-GGUF --image-mean 0.5 0.5 0.5 --image-std 0.5 0.5 0.5
gguf: This GGUF file is for Little Endian only
Traceback (most recent call last):
File "/content/llama.cpp/./examples/minicpmv/convert-image-encoder-to-gguf.py", line 295, in <module>
data = data.squeeze().numpy()
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
BTW it may be better if you can add device_map="auto" in minicpm-surgery.py#L12&L42 :)
It can take full advantages of a GPU :)
Get another bug when converting the image encoder to gguf.
Log:
python3 ./examples/minicpmv/convert-image-encoder-to-gguf.py -m ../MiniCPM-V-2 --llava-projector ../MiniCPM-V-2/llava.projector --output-dir ../MiniCPM-V-2-GGUF --image-mean 0.5 0.5 0.5 --image-std 0.5 0.5 0.5 gguf: This GGUF file is for Little Endian only Traceback (most recent call last): File "/content/llama.cpp/./examples/minicpmv/convert-image-encoder-to-gguf.py", line 295, in <module> data = data.squeeze().numpy() TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
I followed examples/llava/convert-image-encoder-to-gguf.py. It seems that they also don't use .cpu() here, and in my environment the model are loaded to CPU by default.
Should set it so the device defaults to CPU and optionally set to GPU if desired.
I followed
examples/llava/convert-image-encoder-to-gguf.py. It seems that they also don't use.cpu()here, and in my environment the model are loaded to CPU by default.
I fixed my problems by editing the minicpmv-surgery.py.
Edited version:
# store these tensors in a new dictionary and torch.save them
projector = {name: checkpoint[name].float().cpu() for name in mm_tensors}
torch.save(projector, f"{args.model}/llava.projector")
clip_tensors = [k for k, v in checkpoint.items() if k.startswith("vpm")]
if len(clip_tensors) > 0:
clip = {name.replace("vpm.", ""): checkpoint[name].float().cpu() for name in clip_tensors}
torch.save(clip, f"{args.model}/llava.clip")
I think it would be better to add the cpu(), since it has no impact (maybe?) on CPU-only environment and, with device_map="auto", we can make good use of GPU.
Get another bug.
Log:
> python3 ./examples/minicpmv/minicpm-surgery.py -m ../MiniCPM-V-2
Loading checkpoint shards: 100% 2/2 [00:34<00:00, 17.07s/it]
Done!
Now you can convert ../MiniCPM-V-2 to a regular LLaMA GGUF file.
Also, use ../MiniCPM-V-2/llava.projector to prepare a llava-encoder.gguf file.
> python3 ./examples/minicpmv/convert-image-encoder-to-gguf.py -m ../MiniCPM-V-2 --llava-projector ../MiniCPM-V-2/llava.projector --output-dir ../MiniCPM-V-2-GGUF --image-mean 0.5 0.5 0.5 --image-std 0.5 0.5 0.5
gguf: This GGUF file is for Little Endian only
Converting to float32
resampler.pos_embed - f32 - shape = (64, 2304)
Converting to float32
...(too long, ignore)
v.post_ln.weight - f32 - shape = (1152,)
Converting to float32
v.post_ln.bias - f32 - shape = (1152,)
Done. Output file: ../MiniCPM-V-2-GGUF/mmproj-model-f16.gguf
> python3 ./convert-hf-to-gguf.py --outtype f16 --outfile ../MiniCPM-V-2-GGUF/MiniCPM-V-2.F16.gguf ../MiniCPM-V-2/MiniCPM
Loading model: MiniCPM
gguf: This GGUF file is for Little Endian only
Set model parameters
Set model tokenizer
Traceback (most recent call last):
File "/content/llama.cpp/./convert-hf-to-gguf.py", line 2809, in <module>
main()
File "/content/llama.cpp/./convert-hf-to-gguf.py", line 2796, in main
model_instance.set_vocab()
File "/content/llama.cpp/./convert-hf-to-gguf.py", line 1645, in set_vocab
self._set_vocab_llama_hf()
File "/content/llama.cpp/./convert-hf-to-gguf.py", line 377, in _set_vocab_llama_hf
vocab = LlamaHfVocab(self.dir_model)
File "/content/llama.cpp/convert.py", line 523, in __init__
with open(fname_tokenizer, encoding='utf-8') as f:
FileNotFoundError: [Errno 2] No such file or directory: '../MiniCPM-V-2/MiniCPM/tokenizer.json'
> cp ../MiniCPM-V-2/tokenizer.json ../MiniCPM-V-2/MiniCPM/
> python3 ./convert-hf-to-gguf.py --outtype f16 --outfile ../MiniCPM-V-2-GGUF/MiniCPM-V-2.F16.gguf ../MiniCPM-V-2/MiniCPM
Loading model: MiniCPM
gguf: This GGUF file is for Little Endian only
Set model parameters
Set model tokenizer
The repository for ../MiniCPM-V-2/MiniCPM contains custom code which must be executed to correctly load the model. You can inspect the repository content at [https://hf.co/../MiniCPM-V-2/MiniCPM](https://hf.co/MiniCPM-V-2/MiniCPM).
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.
Do you wish to run the custom code? [y/N] y
Traceback (most recent call last):
File "/content/llama.cpp/./convert-hf-to-gguf.py", line 2809, in <module>
main()
File "/content/llama.cpp/./convert-hf-to-gguf.py", line 2796, in main
model_instance.set_vocab()
File "/content/llama.cpp/./convert-hf-to-gguf.py", line 1645, in set_vocab
self._set_vocab_llama_hf()
File "/content/llama.cpp/./convert-hf-to-gguf.py", line 377, in _set_vocab_llama_hf
vocab = LlamaHfVocab(self.dir_model)
File "/content/llama.cpp/convert.py", line 556, in __init__
assert self.tokenizer.is_fast # assume tokenizer.json is used
AssertionError
Does your MiniCPM-V-2 folder have tokenizer.json? It is a newly uploaded file in https://huggingface.co/openbmb/MiniCPM-V-2/tree/main.
Does your MiniCPM-V-2 folder have
tokenizer.json? It is a newly uploaded file in https://huggingface.co/openbmb/MiniCPM-V-2/tree/main.
Yes, I confirm that.
Log:
> ls ../MiniCPM-V-2/ -alh
total 8.0G
drwxr-xr-x 4 root root 4.0K May 2 06:17 .
drwxr-xr-x 1 root root 4.0K May 2 06:16 ..
drwxr-xr-x 2 root root 4.0K May 2 06:13 assets
-rw-r--r-- 1 root root 1.2K May 2 06:13 config.json
-rw-r--r-- 1 root root 11K May 2 06:13 configuration_minicpm.py
-rw-r--r-- 1 root root 111 May 2 06:13 generation_config.json
-rw-r--r-- 1 root root 1.7K May 2 06:13 .gitattributes
-rw-r--r-- 1 root root 1.5G May 2 06:17 llava.clip
-rw-r--r-- 1 root root 113M May 2 06:17 llava.projector
drwxr-xr-x 2 root root 4.0K May 2 06:21 MiniCPM
-rw-r--r-- 1 root root 4.7G May 2 06:14 model-00001-of-00002.safetensors
-rw-r--r-- 1 root root 1.8G May 2 06:13 model-00002-of-00002.safetensors
-rw-r--r-- 1 root root 70K May 2 06:13 modeling_minicpm.py
-rw-r--r-- 1 root root 20K May 2 06:13 modeling_minicpmv.py
-rw-r--r-- 1 root root 54K May 2 06:13 model.safetensors.index.json
-rw-r--r-- 1 root root 9.2K May 2 06:13 README.md
-rw-r--r-- 1 root root 5.5K May 2 06:13 resampler.py
-rw-r--r-- 1 root root 651 May 2 06:13 special_tokens_map.json
-rw-r--r-- 1 root root 3.3K May 2 06:13 tokenizer_config.json
-rw-r--r-- 1 root root 6.0M May 2 06:13 tokenizer.json
-rw-r--r-- 1 root root 2.0M May 2 06:13 tokenizer.model
> ls ../MiniCPM-V-2/MiniCPM -alh
total 12G
drwxr-xr-x 2 root root 4.0K May 2 06:21 .
drwxr-xr-x 4 root root 4.0K May 2 06:17 ..
-rw-r--r-- 1 root root 1.5K May 2 06:17 config.json
-rw-r--r-- 1 root root 11K May 2 06:21 configuration_minicpm.py
-rw-r--r-- 1 root root 111 May 2 06:17 generation_config.json
-rw-r--r-- 1 root root 4.7G May 2 06:18 model-00001-of-00003.safetensors
-rw-r--r-- 1 root root 4.7G May 2 06:20 model-00002-of-00003.safetensors
-rw-r--r-- 1 root root 2.0G May 2 06:21 model-00003-of-00003.safetensors
-rw-r--r-- 1 root root 70K May 2 06:21 modeling_minicpm.py
-rw-r--r-- 1 root root 20K May 2 06:21 modeling_minicpmv.py
-rw-r--r-- 1 root root 30K May 2 06:21 model.safetensors.index.json
-rw-r--r-- 1 root root 5.5K May 2 06:21 resampler.py
-rw-r--r-- 1 root root 765 May 2 06:21 special_tokens_map.json
-rw-r--r-- 1 root root 3.4K May 2 06:21 tokenizer_config.json
-rw-r--r-- 1 root root 2.0M May 2 06:21 tokenizer.model
Does your MiniCPM-V-2 folder have
tokenizer.json? It is a newly uploaded file in https://huggingface.co/openbmb/MiniCPM-V-2/tree/main.Yes, I confirm that.
Log:
> ls MiniCPM-V-2/ -alh total 8.0G drwxr-xr-x 4 root root 4.0K May 2 06:17 . drwxr-xr-x 1 root root 4.0K May 2 06:16 .. drwxr-xr-x 2 root root 4.0K May 2 06:13 assets -rw-r--r-- 1 root root 1.2K May 2 06:13 config.json -rw-r--r-- 1 root root 11K May 2 06:13 configuration_minicpm.py -rw-r--r-- 1 root root 111 May 2 06:13 generation_config.json -rw-r--r-- 1 root root 1.7K May 2 06:13 .gitattributes -rw-r--r-- 1 root root 1.5G May 2 06:17 llava.clip -rw-r--r-- 1 root root 113M May 2 06:17 llava.projector drwxr-xr-x 2 root root 4.0K May 2 06:21 MiniCPM -rw-r--r-- 1 root root 4.7G May 2 06:14 model-00001-of-00002.safetensors -rw-r--r-- 1 root root 1.8G May 2 06:13 model-00002-of-00002.safetensors -rw-r--r-- 1 root root 70K May 2 06:13 modeling_minicpm.py -rw-r--r-- 1 root root 20K May 2 06:13 modeling_minicpmv.py -rw-r--r-- 1 root root 54K May 2 06:13 model.safetensors.index.json -rw-r--r-- 1 root root 9.2K May 2 06:13 README.md -rw-r--r-- 1 root root 5.5K May 2 06:13 resampler.py -rw-r--r-- 1 root root 651 May 2 06:13 special_tokens_map.json -rw-r--r-- 1 root root 3.3K May 2 06:13 tokenizer_config.json -rw-r--r-- 1 root root 6.0M May 2 06:13 tokenizer.json -rw-r--r-- 1 root root 2.0M May 2 06:13 tokenizer.model > ls MiniCPM-V-2/MiniCPM -alh total 12G drwxr-xr-x 2 root root 4.0K May 2 06:21 . drwxr-xr-x 4 root root 4.0K May 2 06:17 .. -rw-r--r-- 1 root root 1.5K May 2 06:17 config.json -rw-r--r-- 1 root root 11K May 2 06:21 configuration_minicpm.py -rw-r--r-- 1 root root 111 May 2 06:17 generation_config.json -rw-r--r-- 1 root root 4.7G May 2 06:18 model-00001-of-00003.safetensors -rw-r--r-- 1 root root 4.7G May 2 06:20 model-00002-of-00003.safetensors -rw-r--r-- 1 root root 2.0G May 2 06:21 model-00003-of-00003.safetensors -rw-r--r-- 1 root root 70K May 2 06:21 modeling_minicpm.py -rw-r--r-- 1 root root 20K May 2 06:21 modeling_minicpmv.py -rw-r--r-- 1 root root 30K May 2 06:21 model.safetensors.index.json -rw-r--r-- 1 root root 5.5K May 2 06:21 resampler.py -rw-r--r-- 1 root root 765 May 2 06:21 special_tokens_map.json -rw-r--r-- 1 root root 3.4K May 2 06:21 tokenizer_config.json -rw-r--r-- 1 root root 2.0M May 2 06:21 tokenizer.model
So it seems that the save_pretrained method in surgery.py do not save the tokenizer.json file. I manually copy the tokenizer.json file into the MiniCPM sub-folder before the tokenizer.json is uploaded. I wonder the manually copy is no longer needed since it is uploaded. Seems I was wrong.
So it seems that the save_pretrained method in surgery.py do not save the tokenizer.json file. I manually copy the tokenizer.json file into the MiniCPM sub-folder before the tokenizer.json is uploaded. I wonder the manually copy is no longer needed since it is uploaded. Seems I was wrong.
However even if I copy it to sub-folder, the converting script didn't work either :(
See here: https://github.com/ggerganov/llama.cpp/pull/6919#issuecomment-2089645626
cp ../MiniCPM-V-2/tokenizer.json ../MiniCPM-V-2/MiniCPM/
fixed.
Convert successfully, thx!
However, I got a bad test result...
Log here:
> ./minicpmv-cli -ngl 1000000 -m ./MiniCPM-V-2-GGUF/MiniCPM-V-2.F16.gguf --mmproj ./MiniCPM-V-2-GGUF/mmproj-model-f16.gguf -c 4096 --temp 0.6 --top-p 0.8 --top-k 100 --repeat-penalty 1.0 --image ./mzwing.jpg -p "่ฟๅผ ๅพ้ๆไปไน?"
Log start
clip_model_load: description: image encoder for LLaVA
clip_model_load: GGUF version: 3
clip_model_load: alignment: 32
clip_model_load: n_tensors: 440
clip_model_load: n_kv: 18
clip_model_load: ftype: f16
clip_model_load: loaded meta data with 18 key-value pairs and 440 tensors from ./MiniCPM-V-2-GGUF/mmproj-model-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv 0: general.architecture str = clip
clip_model_load: - kv 1: clip.has_text_encoder bool = false
clip_model_load: - kv 2: clip.has_vision_encoder bool = true
clip_model_load: - kv 3: clip.has_llava_projector bool = true
clip_model_load: - kv 4: general.file_type u32 = 1
clip_model_load: - kv 5: general.description str = image encoder for LLaVA
clip_model_load: - kv 6: clip.projector_type str = resampler
clip_model_load: - kv 7: clip.vision.image_size u32 = 448
clip_model_load: - kv 8: clip.vision.patch_size u32 = 14
clip_model_load: - kv 9: clip.vision.embedding_length u32 = 1152
clip_model_load: - kv 10: clip.vision.feed_forward_length u32 = 4304
clip_model_load: - kv 11: clip.vision.projection_dim u32 = 0
clip_model_load: - kv 12: clip.vision.attention.head_count u32 = 16
clip_model_load: - kv 13: clip.vision.attention.layer_norm_epsilon f32 = 0.000001
clip_model_load: - kv 14: clip.vision.block_count u32 = 26
clip_model_load: - kv 15: clip.vision.image_mean arr[f32,3] = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv 16: clip.vision.image_std arr[f32,3] = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv 17: clip.use_gelu bool = true
clip_model_load: - type f32: 277 tensors
clip_model_load: - type f16: 163 tensors
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: Tesla T4, compute capability 7.5, VMM: yes
clip_model_load: CLIP using CUDA backend
clip_model_load: text_encoder: 0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector: 1
clip_model_load: model size: 828.18 MB
clip_model_load: metadata size: 0.17 MB
clip_model_load: params backend buffer size = 828.18 MB (440 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_model_load: compute allocated memory: 88.80 MB
llama_model_loader: loaded meta data with 22 key-value pairs and 363 tensors from ./MiniCPM-V-2-GGUF/MiniCPM-V-2.F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = minicpm
llama_model_loader: - kv 1: general.name str = MiniCPM
llama_model_loader: - kv 2: minicpm.context_length u32 = 4096
llama_model_loader: - kv 3: minicpm.embedding_length u32 = 2304
llama_model_loader: - kv 4: minicpm.block_count u32 = 40
llama_model_loader: - kv 5: minicpm.feed_forward_length u32 = 5760
llama_model_loader: - kv 6: minicpm.rope.dimension_count u32 = 64
llama_model_loader: - kv 7: minicpm.attention.head_count u32 = 36
llama_model_loader: - kv 8: minicpm.attention.head_count_kv u32 = 36
llama_model_loader: - kv 9: minicpm.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 1
llama_model_loader: - kv 11: minicpm.tie_lm_head bool = false
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,122753] = ["<unk>", "<s>", "</s>", "<SEP>", "<C...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,122753] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,122753] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type f16: 282 tensors
llm_load_vocab: mismatch in special tokens definition ( 3528/122753 vs 271/122753 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = minicpm
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 122753
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 2304
llm_load_print_meta: n_head = 36
llm_load_print_meta: n_head_kv = 36
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_embd_head_k = 64
llm_load_print_meta: n_embd_head_v = 64
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 2304
llm_load_print_meta: n_embd_v_gqa = 2304
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 5760
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 2B
llm_load_print_meta: model ftype = F16
llm_load_print_meta: model params = 3.01 B
llm_load_print_meta: model size = 5.60 GiB (16.00 BPW)
llm_load_print_meta: general.name = MiniCPM
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token = 1099 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.37 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors: CPU buffer size = 539.44 MiB
llm_load_tensors: CUDA0 buffer size = 5197.65 MiB
....................................................................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 1440.00 MiB
llama_new_context_with_model: KV self size = 1440.00 MiB, K (f16): 720.00 MiB, V (f16): 720.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.47 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 314.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 12.51 MiB
llama_new_context_with_model: graph nodes = 1368
llama_new_context_with_model: graph splits = 2
encode_image_with_clip: image embedding created: 64 tokens
encode_image_with_clip: image encoded in 455.73 ms by CLIP ( 7.12 ms per image patch)
slice_image: multiple 1
<็จๆท>่ฟๅผ ๅพ้ๆไปไน?
<AI>
ใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใ
llama_print_timings: load time = 27998.25 ms
llama_print_timings: sample time = 35.94 ms / 256 runs ( 0.14 ms per token, 7122.19 tokens per second)
llama_print_timings: prompt eval time = 196.38 ms / 80 tokens ( 2.45 ms per token, 407.36 tokens per second)
llama_print_timings: eval time = 7742.32 ms / 255 runs ( 30.36 ms per token, 32.94 tokens per second)
llama_print_timings: total time = 36056.03 ms / 335 tokens
The image I used for testing is my GitHub avatar.
My log here, can not reproduce your result
./minicpmv-cli -m ../MiniCPM-V-2/MiniCPM/ggml-model-f16.gguf --mmproj ../MiniCPM-V-2/mmproj-model-f16.gguf -c 4096 --temp 0.6 --top-p 0.8 --top-k 100 --repeat-penalty 1.0 --image ../mzwing.jpg -p "่ฟๅผ ๅพ้ๆไปไน?"
Log start
clip_model_load: description: image encoder for LLaVA
clip_model_load: GGUF version: 3
clip_model_load: alignment: 32
clip_model_load: n_tensors: 440
clip_model_load: n_kv: 18
clip_model_load: ftype: f16
clip_model_load: loaded meta data with 18 key-value pairs and 440 tensors from ../MiniCPM-V-2/mmproj-model-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv 0: general.architecture str = clip
clip_model_load: - kv 1: clip.has_text_encoder bool = false
clip_model_load: - kv 2: clip.has_vision_encoder bool = true
clip_model_load: - kv 3: clip.has_llava_projector bool = true
clip_model_load: - kv 4: general.file_type u32 = 1
clip_model_load: - kv 5: general.description str = image encoder for LLaVA
clip_model_load: - kv 6: clip.projector_type str = resampler
clip_model_load: - kv 7: clip.vision.image_size u32 = 448
clip_model_load: - kv 8: clip.vision.patch_size u32 = 14
clip_model_load: - kv 9: clip.vision.embedding_length u32 = 1152
clip_model_load: - kv 10: clip.vision.feed_forward_length u32 = 4304
clip_model_load: - kv 11: clip.vision.projection_dim u32 = 0
clip_model_load: - kv 12: clip.vision.attention.head_count u32 = 16
clip_model_load: - kv 13: clip.vision.attention.layer_norm_epsilon f32 = 0.000001
clip_model_load: - kv 14: clip.vision.block_count u32 = 26
clip_model_load: - kv 15: clip.vision.image_mean arr[f32,3] = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv 16: clip.vision.image_std arr[f32,3] = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv 17: clip.use_gelu bool = true
clip_model_load: - type f32: 277 tensors
clip_model_load: - type f16: 163 tensors
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/acha/Desktop/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name: Apple M1 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 22906.50 MB
clip_model_load: CLIP using Metal backend
clip_model_load: text_encoder: 0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector: 1
clip_model_load: model size: 828.18 MB
clip_model_load: metadata size: 0.17 MB
clip_model_load: params backend buffer size = 828.18 MB (440 tensors)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 828.19 MiB, ( 829.19 / 21845.34)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 88.81 MiB, ( 918.00 / 21845.34)
clip_model_load: compute allocated memory: 88.80 MB
llama_model_loader: loaded meta data with 22 key-value pairs and 363 tensors from ../MiniCPM-V-2/MiniCPM/ggml-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = minicpm
llama_model_loader: - kv 1: general.name str = MiniCPM
llama_model_loader: - kv 2: minicpm.context_length u32 = 4096
llama_model_loader: - kv 3: minicpm.embedding_length u32 = 2304
llama_model_loader: - kv 4: minicpm.block_count u32 = 40
llama_model_loader: - kv 5: minicpm.feed_forward_length u32 = 5760
llama_model_loader: - kv 6: minicpm.rope.dimension_count u32 = 64
llama_model_loader: - kv 7: minicpm.attention.head_count u32 = 36
llama_model_loader: - kv 8: minicpm.attention.head_count_kv u32 = 36
llama_model_loader: - kv 9: minicpm.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 1
llama_model_loader: - kv 11: minicpm.tie_lm_head bool = false
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,122753] = ["<unk>", "<s>", "</s>", "<SEP>", "<C...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,122753] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,122753] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type f16: 282 tensors
llm_load_vocab: mismatch in special tokens definition ( 3528/122753 vs 271/122753 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = minicpm
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 122753
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 2304
llm_load_print_meta: n_head = 36
llm_load_print_meta: n_head_kv = 36
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_embd_head_k = 64
llm_load_print_meta: n_embd_head_v = 64
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 2304
llm_load_print_meta: n_embd_v_gqa = 2304
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 5760
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 2B
llm_load_print_meta: model ftype = F16
llm_load_print_meta: model params = 3.01 B
llm_load_print_meta: model size = 5.60 GiB (16.00 BPW)
llm_load_print_meta: general.name = MiniCPM
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token = 1099 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.37 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 5197.67 MiB, ( 6115.67 / 21845.34)
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors: Metal buffer size = 5197.66 MiB
llm_load_tensors: CPU buffer size = 539.44 MiB
...................................................................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/acha/Desktop/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name: Apple M1 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 22906.50 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 1440.00 MiB, ( 7556.42 / 21845.34)
llama_kv_cache_init: Metal KV buffer size = 1440.00 MiB
llama_new_context_with_model: KV self size = 1440.00 MiB, K (f16): 720.00 MiB, V (f16): 720.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.47 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 314.02 MiB, ( 7870.44 / 21845.34)
llama_new_context_with_model: Metal compute buffer size = 314.00 MiB
llama_new_context_with_model: CPU compute buffer size = 12.51 MiB
llama_new_context_with_model: graph nodes = 1368
llama_new_context_with_model: graph splits = 2
encode_image_with_clip: image embedding created: 64 tokens
encode_image_with_clip: image encoded in 2532.64 ms by CLIP ( 39.57 ms per image patch)
slice_image: multiple 1
<็จๆท>่ฟๅผ ๅพ้ๆไปไน?
<AI>
่ฟๅน
ๅพ็ๅฑ็คบไบไธไธชไบบ็ซๅจไธไธช็่ตทๆฅๅๅดๆ ็ๅฐๆน๏ผๅจๅดๆฏ่่ฒ็ๆตทๆดใ่ฟไธชไบบๆญฃๅจไผธๆๅป่งฆ็ขฐๅคฉ็ฉบไธญ็้ธ็พค๏ผ่ฟไบ้ธ็พคไปฅไธ็งๆฝ่ฑก็ๆนๅผๆๅๆไธๆก็บฟใ่ฟๅน
็ป็้ฃๆ ผๆฏๆฐดๅฝฉ๏ผ็ปไบบไธ็งๆขฆๅนปใๅฎ้็ๆ่งใ้ข่ฒไปฅ่่ฒๅ็ฝ่ฒไธบไธป๏ผ่่ฒ่ฑกๅพ็ๆตทๆดๅๅคฉ็ฉบ๏ผ็ฝ่ฒๅไปฃ่กจไบๅฝฉๅ้ธ็พคใ
llama_print_timings: load time = 11091.30 ms
llama_print_timings: sample time = 5.67 ms / 74 runs ( 0.08 ms per token, 13053.45 tokens per second)
llama_print_timings: prompt eval time = 8420.87 ms / 80 tokens ( 105.26 ms per token, 9.50 tokens per second)
llama_print_timings: eval time = 2901.03 ms / 73 runs ( 39.74 ms per token, 25.16 tokens per second)
llama_print_timings: total time = 14061.83 ms / 153 tokens
ggml_metal_free: deallocating
ggml_metal_free: deallocating
@Achazwl Can you help test the model I quantized? Link here: https://huggingface.co/mzwing/MiniCPM-V-2-GGUF
@Achazwl Can you test the model I quantized? Link here: https://huggingface.co/mzwing/MiniCPM-V-2-GGUF
The link you provided only contains fp16 models
The link you provided only contains fp16 models
The mmproj gguf model is actually there, I just rename it :)
Link to the mmproj gguf model: https://huggingface.co/mzwing/MiniCPM-V-2-GGUF/blob/main/MiniCPM-V-2-mmproj.F16.gguf
Also correct
./minicpmv-cli -m ../MiniCPM-V-2-GGUF/MiniCPM-V-2.F16.gguf --mmproj ../MiniCPM-V-2-GGUF/MiniCPM-V-2-mmproj.F16.gguf -c 4096 --temp 0.6 --top-p 0.8 --top-k 100 --repeat-penalty 1.0 --image ../mzwing.jpg -p "่ฟๅผ ๅพ้ๆไปไน?"
Log start
clip_model_load: description: image encoder for LLaVA
clip_model_load: GGUF version: 3
clip_model_load: alignment: 32
clip_model_load: n_tensors: 440
clip_model_load: n_kv: 18
clip_model_load: ftype: f16
clip_model_load: loaded meta data with 18 key-value pairs and 440 tensors from ../MiniCPM-V-2-GGUF/MiniCPM-V-2-mmproj.F16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv 0: general.architecture str = clip
clip_model_load: - kv 1: clip.has_text_encoder bool = false
clip_model_load: - kv 2: clip.has_vision_encoder bool = true
clip_model_load: - kv 3: clip.has_llava_projector bool = true
clip_model_load: - kv 4: general.file_type u32 = 1
clip_model_load: - kv 5: general.description str = image encoder for LLaVA
clip_model_load: - kv 6: clip.projector_type str = resampler
clip_model_load: - kv 7: clip.vision.image_size u32 = 448
clip_model_load: - kv 8: clip.vision.patch_size u32 = 14
clip_model_load: - kv 9: clip.vision.embedding_length u32 = 1152
clip_model_load: - kv 10: clip.vision.feed_forward_length u32 = 4304
clip_model_load: - kv 11: clip.vision.projection_dim u32 = 0
clip_model_load: - kv 12: clip.vision.attention.head_count u32 = 16
clip_model_load: - kv 13: clip.vision.attention.layer_norm_epsilon f32 = 0.000001
clip_model_load: - kv 14: clip.vision.block_count u32 = 26
clip_model_load: - kv 15: clip.vision.image_mean arr[f32,3] = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv 16: clip.vision.image_std arr[f32,3] = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv 17: clip.use_gelu bool = true
clip_model_load: - type f32: 277 tensors
clip_model_load: - type f16: 163 tensors
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/acha/Desktop/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name: Apple M1 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 22906.50 MB
clip_model_load: CLIP using Metal backend
clip_model_load: text_encoder: 0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector: 1
clip_model_load: model size: 828.18 MB
clip_model_load: metadata size: 0.17 MB
clip_model_load: params backend buffer size = 828.18 MB (440 tensors)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 828.19 MiB, ( 829.19 / 21845.34)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 88.81 MiB, ( 918.00 / 21845.34)
clip_model_load: compute allocated memory: 88.80 MB
llama_model_loader: loaded meta data with 22 key-value pairs and 363 tensors from ../MiniCPM-V-2-GGUF/MiniCPM-V-2.F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = minicpm
llama_model_loader: - kv 1: general.name str = MiniCPM
llama_model_loader: - kv 2: minicpm.context_length u32 = 4096
llama_model_loader: - kv 3: minicpm.embedding_length u32 = 2304
llama_model_loader: - kv 4: minicpm.block_count u32 = 40
llama_model_loader: - kv 5: minicpm.feed_forward_length u32 = 5760
llama_model_loader: - kv 6: minicpm.rope.dimension_count u32 = 64
llama_model_loader: - kv 7: minicpm.attention.head_count u32 = 36
llama_model_loader: - kv 8: minicpm.attention.head_count_kv u32 = 36
llama_model_loader: - kv 9: minicpm.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 1
llama_model_loader: - kv 11: minicpm.tie_lm_head bool = false
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,122753] = ["<unk>", "<s>", "</s>", "<SEP>", "<C...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,122753] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,122753] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type f16: 282 tensors
llm_load_vocab: mismatch in special tokens definition ( 3528/122753 vs 271/122753 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = minicpm
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 122753
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 2304
llm_load_print_meta: n_head = 36
llm_load_print_meta: n_head_kv = 36
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_embd_head_k = 64
llm_load_print_meta: n_embd_head_v = 64
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 2304
llm_load_print_meta: n_embd_v_gqa = 2304
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 5760
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 2B
llm_load_print_meta: model ftype = F16
llm_load_print_meta: model params = 3.01 B
llm_load_print_meta: model size = 5.60 GiB (16.00 BPW)
llm_load_print_meta: general.name = MiniCPM
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token = 1099 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.37 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 5197.67 MiB, ( 6115.67 / 21845.34)
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors: Metal buffer size = 5197.66 MiB
llm_load_tensors: CPU buffer size = 539.44 MiB
...................................................................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/acha/Desktop/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name: Apple M1 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 22906.50 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 1440.00 MiB, ( 7556.42 / 21845.34)
llama_kv_cache_init: Metal KV buffer size = 1440.00 MiB
llama_new_context_with_model: KV self size = 1440.00 MiB, K (f16): 720.00 MiB, V (f16): 720.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.47 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 314.02 MiB, ( 7870.44 / 21845.34)
llama_new_context_with_model: Metal compute buffer size = 314.00 MiB
llama_new_context_with_model: CPU compute buffer size = 12.51 MiB
llama_new_context_with_model: graph nodes = 1368
llama_new_context_with_model: graph splits = 2
encode_image_with_clip: image embedding created: 64 tokens
encode_image_with_clip: image encoded in 2460.77 ms by CLIP ( 38.45 ms per image patch)
slice_image: multiple 1
<็จๆท>่ฟๅผ ๅพ้ๆไปไน?
<AI>
่ฟๅผ ๅพ็ๆ็ปไบไธไธชไบบ็ซๅจไธไธช็่ตทๆฅๅๆฏๆ ๆ็ๅฐๆน๏ผๆๅๅคฉ็ฉบใ่ฟไธชไบบไผผไนๆญฃๅจไผธๆๅๅคฉ็ฉบ๏ผๅฏ่ฝๆฏๅจ่ฏๅพๆๆๆ่งฆๆธๆๆๆ้ธๅฟใๅคฉ็ฉบๆฏๆทฑ่่ฒ็๏ผ็น็ผ็่ฎธๅคๆๆๅๆฃ่ฝ็้ธ็พค๏ผ็ปไบบไธ็งๆตฉ็ๅๅฎ้็ๆ่งใ่ฟๅน
็ป้็จไบๆฐดๅฝฉ็ป้ฃๆ ผ๏ผๆๅ็ๆฐดๅฝฉ็ฌ่งฆ่ฅ้ ๅบไธ็งๆขฆๅนป่ฌใ็ฅๅธฆๅฟง้็ๆฐๅดใ
llama_print_timings: load time = 9274.04 ms
llama_print_timings: sample time = 5.80 ms / 76 runs ( 0.08 ms per token, 13103.45 tokens per second)
llama_print_timings: prompt eval time = 6685.98 ms / 80 tokens ( 83.57 ms per token, 11.97 tokens per second)
llama_print_timings: eval time = 2971.87 ms / 75 runs ( 39.62 ms per token, 25.24 tokens per second)
llama_print_timings: total time = 12316.09 ms / 155 tokens
ggml_metal_free: deallocating
ggml_metal_free: deallocating
I did some further tests.
When I use only the CPU, the model's output is very, very normal. However, when I switching to the GPU, the model seemed... mad.
Tested on Google Colab (T4 GPU).
Log:
> ./minicpmv-cli -ngl 35 -m ./MiniCPM-V-2-GGUF/MiniCPM-V-2.Q2_K.gguf --mmproj ./MiniCPM-V-2-GGUF/MiniCPM-V-2-mmproj.F16.gguf -c 4096 --temp 0.6 --top-p 0.8 --top-k 100 --repeat-penalty 1.0 --image ./mzwing.jpg -p "่ฟๅผ ๅพ้ๆไปไน?"
Log start
clip_model_load: description: image encoder for LLaVA
clip_model_load: GGUF version: 3
clip_model_load: alignment: 32
clip_model_load: n_tensors: 440
clip_model_load: n_kv: 18
clip_model_load: ftype: f16
clip_model_load: loaded meta data with 18 key-value pairs and 440 tensors from ./MiniCPM-V-2-GGUF/MiniCPM-V-2-mmproj.F16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv 0: general.architecture str = clip
clip_model_load: - kv 1: clip.has_text_encoder bool = false
clip_model_load: - kv 2: clip.has_vision_encoder bool = true
clip_model_load: - kv 3: clip.has_llava_projector bool = true
clip_model_load: - kv 4: general.file_type u32 = 1
clip_model_load: - kv 5: general.description str = image encoder for LLaVA
clip_model_load: - kv 6: clip.projector_type str = resampler
clip_model_load: - kv 7: clip.vision.image_size u32 = 448
clip_model_load: - kv 8: clip.vision.patch_size u32 = 14
clip_model_load: - kv 9: clip.vision.embedding_length u32 = 1152
clip_model_load: - kv 10: clip.vision.feed_forward_length u32 = 4304
clip_model_load: - kv 11: clip.vision.projection_dim u32 = 0
clip_model_load: - kv 12: clip.vision.attention.head_count u32 = 16
clip_model_load: - kv 13: clip.vision.attention.layer_norm_epsilon f32 = 0.000001
clip_model_load: - kv 14: clip.vision.block_count u32 = 26
clip_model_load: - kv 15: clip.vision.image_mean arr[f32,3] = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv 16: clip.vision.image_std arr[f32,3] = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv 17: clip.use_gelu bool = true
clip_model_load: - type f32: 277 tensors
clip_model_load: - type f16: 163 tensors
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: Tesla T4, compute capability 7.5, VMM: yes
clip_model_load: CLIP using CUDA backend
clip_model_load: text_encoder: 0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector: 1
clip_model_load: model size: 828.18 MB
clip_model_load: metadata size: 0.17 MB
clip_model_load: params backend buffer size = 828.18 MB (440 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_model_load: compute allocated memory: 88.80 MB
llama_model_loader: loaded meta data with 23 key-value pairs and 363 tensors from ./MiniCPM-V-2-GGUF/MiniCPM-V-2.Q2_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = minicpm
llama_model_loader: - kv 1: general.name str = MiniCPM
llama_model_loader: - kv 2: minicpm.context_length u32 = 4096
llama_model_loader: - kv 3: minicpm.embedding_length u32 = 2304
llama_model_loader: - kv 4: minicpm.block_count u32 = 40
llama_model_loader: - kv 5: minicpm.feed_forward_length u32 = 5760
llama_model_loader: - kv 6: minicpm.rope.dimension_count u32 = 64
llama_model_loader: - kv 7: minicpm.attention.head_count u32 = 36
llama_model_loader: - kv 8: minicpm.attention.head_count_kv u32 = 36
llama_model_loader: - kv 9: minicpm.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 10
llama_model_loader: - kv 11: minicpm.tie_lm_head bool = false
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,122753] = ["<unk>", "<s>", "</s>", "<SEP>", "<C...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,122753] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,122753] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type q2_K: 161 tensors
llama_model_loader: - type q3_K: 80 tensors
llama_model_loader: - type q6_K: 1 tensors
llama_model_loader: - type iq4_nl: 40 tensors
llm_load_vocab: mismatch in special tokens definition ( 3528/122753 vs 271/122753 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = minicpm
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 122753
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 2304
llm_load_print_meta: n_head = 36
llm_load_print_meta: n_head_kv = 36
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_embd_head_k = 64
llm_load_print_meta: n_embd_head_v = 64
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 2304
llm_load_print_meta: n_embd_v_gqa = 2304
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 5760
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 2B
llm_load_print_meta: model ftype = Q2_K - Medium
llm_load_print_meta: model params = 3.01 B
llm_load_print_meta: model size = 1.21 GiB (3.44 BPW)
llm_load_print_meta: general.name = MiniCPM
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token = 1099 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.37 MiB
llm_load_tensors: offloading 35 repeating layers to GPU
llm_load_tensors: offloaded 35/41 layers to GPU
llm_load_tensors: CPU buffer size = 1234.38 MiB
llm_load_tensors: CUDA0 buffer size = 809.07 MiB
.............................................................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA_Host KV buffer size = 180.00 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 1260.00 MiB
llama_new_context_with_model: KV self size = 1440.00 MiB, K (f16): 720.00 MiB, V (f16): 720.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.47 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 465.51 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 17.01 MiB
llama_new_context_with_model: graph nodes = 1368
llama_new_context_with_model: graph splits = 59
encode_image_with_clip: image embedding created: 64 tokens
encode_image_with_clip: image encoded in 394.83 ms by CLIP ( 6.17 ms per image patch)
slice_image: multiple 1
<็จๆท>่ฟๅผ ๅพ้ๆไปไน?
<AI>
๏ผ
</br></h3><strong></h2></tr><br/><SEP>๏ผใ<li><SEP></h2><h3/>๏ผ--></h3>ใ</h1>๏ผ๏ผ๏ผ</br></strong>๏ผ๏ผ<h4/>=๏ผ</h3>๏ผ๏ผ</h2><tbody/>๏ผ<h3></img><h5>ใ๏ผ</h2><img/></h3><tr>๏ผ๏ฟฅ๏ผ<h4>ใ๏ผ๏ผ</h5><CLS><li></img></h3>ใใใ๏ผใ๏ผ
๏ผ<li>๏ผ<h2/>๏ผ๏ผ๏ผ<tr><h5>?</img><SEP></h1><h4/><CLS>=<h4>```.=<h3/>๏ผ๏ผ<!--๏ผ
๏ผ-๏ผ<td>๏ผใฃ<p/><p/><SEP>ใ<!--๏ฟฅ๏ผใใใ</td>ใใ...<li>
<h5></h5><h2/>ใฃ๏ผ</img>๏ผ๏ผ</h5><h3>๏ผ?.<strong></tr><tr></strong></tbody>๏ผ<h1>๏ผ-->?-ใ</li></tr><h3>๏ผ๏ผ<li/>.</h2><SEP>๏ผ</h2>?<table>๏ผ<br></tbody><h2/><!DOCTYPE>๏ฟฅ๏ผ``````=</img></h5><b><h5/>.<li>๏ฟฅใใ</li>-๏ผ?<li>๏ผ๏ผ
ใ<img/><br/></h1>๏ผใ<tr>.๏ผ<table></br></h1>๏ผ๏ผ!ใ</h2>ใฃ</h4></tbody>๏ฟฅ</li>ใใใใใใใใใใใใ<table/></br>-<li/>๏ผ๏ผ๏ผ๏ผ๏ผใ</h2>ใ<h5>๏ผ๏ผ<h4/>๏ผ
</li>๏ผ</strong>ใ</strong><br><h4/>-->ใ</h4>...<strong/>.<b/>--><tbody/><h4/>๏ผ๏ผ๏ผใ๏ผ<img/>ใ</strong>
๏ผใ<tr>-๏ผ
๏ผ</h5>๏ผ
<p/><h4/><h5><!DOCTYPE><table/>ใ๏ผ</h5></tr>
llama_print_timings: load time = 2538.88 ms
llama_print_timings: sample time = 32.09 ms / 256 runs ( 0.13 ms per token, 7978.56 tokens per second)
llama_print_timings: prompt eval time = 1714.47 ms / 80 tokens ( 21.43 ms per token, 46.66 tokens per second)
llama_print_timings: eval time = 19998.40 ms / 255 runs ( 78.43 ms per token, 12.75 tokens per second)
llama_print_timings: total time = 22909.54 ms / 335 tokens
The binary I compiled is here: https://github.com/MZWNET/actions/releases/tag/llama_cpp-minicpm-v-6c1c4b4
Link to models I quantized: https://huggingface.co/mzwing/MiniCPM-V-2-GGUF
If you need, link to my Jupyter Notebook file here: https://github.com/mzwing/AI-related/blob/master/notebooks/MiniCPM_V_2_GGUF.ipynb
So, it seems to be a GPU-related bug :(
So, it seems to be a GPU-related bug :(
So this may not related to my PR? Since the correctness on CPU indicates that the conversion is correct.
So this may not related to my PR? Since the correctness on CPU indicates that the conversion is correct.
I'm afraid not... This bug only appears when chatting with MiniCPM-V-2 using GPU...
Is the bug happening on LLaVA?
Is the bug happening on LLaVA?
Oh now I find that the llava-cli built in this PR even cannot load the model. It gave out the unable to load model error.
~~For now only tested on GPU env.~~ See comment below.
So, maybe that's the final reason? But the two errors seem too different.
Log:
> ./llava-cli -ngl 35 -m ./LLaVA-Llama-3-8B-Instruct-GGUF/llava-llama3-8b-Q4_K_M.gguf --mmproj ./LLaVA-Llama-3-8B-Instruct-GGUF/llava-llama3-mmproj-f16.gguf --temp 0.6 --top-p 0.8 --top-k 100 --repeat-penalty 1.0 --image ./mzwing.jpg -p "่ฟๅผ ๅพ้ๆไปไน?"
Log start
clip_model_load: model name: openai/clip-vit-large-patch14-336
clip_model_load: description: image encoder for LLaVA
clip_model_load: GGUF version: 3
clip_model_load: alignment: 32
clip_model_load: n_tensors: 377
clip_model_load: n_kv: 19
clip_model_load: ftype: f16
clip_model_load: loaded meta data with 19 key-value pairs and 377 tensors from ./LLaVA-Llama-3-8B-Instruct-GGUF/llava-llama3-mmproj-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv 0: general.architecture str = clip
clip_model_load: - kv 1: clip.has_text_encoder bool = false
clip_model_load: - kv 2: clip.has_vision_encoder bool = true
clip_model_load: - kv 3: clip.has_llava_projector bool = true
clip_model_load: - kv 4: general.file_type u32 = 1
clip_model_load: - kv 5: general.name str = openai/clip-vit-large-patch14-336
clip_model_load: - kv 6: general.description str = image encoder for LLaVA
clip_model_load: - kv 7: clip.projector_type str = mlp
clip_model_load: - kv 8: clip.vision.image_size u32 = 336
clip_model_load: - kv 9: clip.vision.patch_size u32 = 14
clip_model_load: - kv 10: clip.vision.embedding_length u32 = 1024
clip_model_load: - kv 11: clip.vision.feed_forward_length u32 = 4096
clip_model_load: - kv 12: clip.vision.projection_dim u32 = 768
clip_model_load: - kv 13: clip.vision.attention.head_count u32 = 16
clip_model_load: - kv 14: clip.vision.attention.layer_norm_epsilon f32 = 0.000010
clip_model_load: - kv 15: clip.vision.block_count u32 = 23
clip_model_load: - kv 16: clip.vision.image_mean arr[f32,3] = [0.481455, 0.457828, 0.408211]
clip_model_load: - kv 17: clip.vision.image_std arr[f32,3] = [0.268630, 0.261303, 0.275777]
clip_model_load: - kv 18: clip.use_gelu bool = false
clip_model_load: - type f32: 235 tensors
clip_model_load: - type f16: 142 tensors
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: Tesla T4, compute capability 7.5, VMM: yes
clip_model_load: CLIP using CUDA backend
clip_model_load: text_encoder: 0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector: 1
clip_model_load: model size: 595.49 MB
clip_model_load: metadata size: 0.14 MB
clip_model_load: params backend buffer size = 595.49 MB (377 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_model_load: compute allocated memory: 32.89 MB
llama_model_loader: loaded meta data with 23 key-value pairs and 291 tensors from ./LLaVA-Llama-3-8B-Instruct-GGUF/llava-llama3-8b-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = tmp
llama_model_loader: - kv 2: llama.vocab_size u32 = 128257
llama_model_loader: - kv 3: llama.context_length u32 = 8192
llama_model_loader: - kv 4: llama.embedding_length u32 = 4096
llama_model_loader: - kv 5: llama.block_count u32 = 32
llama_model_loader: - kv 6: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 7: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 8: llama.attention.head_count u32 = 32
llama_model_loader: - kv 9: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 11: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,128257] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 15: tokenizer.ggml.scores arr[f32,128257] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["ฤ ฤ ", "ฤ ฤ ฤ ฤ ", "ฤ ฤ ฤ ฤ ", "...
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128001
llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 128256
llama_model_loader: - kv 21: tokenizer.chat_template str = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_K: 193 tensors
llama_model_loader: - type q6_K: 33 tensors
llm_load_vocab: special tokens definition check successful ( 257/128257 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128257
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: n_ctx_train = 8192
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 8192
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 8B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 8.03 B
llm_load_print_meta: model size = 4.58 GiB (4.89 BPW)
llm_load_print_meta: general.name = tmp
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128001 '<|end_of_text|>'
llm_load_print_meta: PAD token = 128256 '<pad>'
llm_load_print_meta: LF token = 128 'ร'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_tensors: ggml ctx size = 0.30 MiB
llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 291, got 290
llama_load_model_from_file: failed to load model
llava_init: error: unable to load model
main: error: failed to init llava
For now only tested on GPU env.
CPU also the same.
So it's quite confusing. Maybe you should update your branch?
llava-cli in original llama.cpp repo (master branch) works as expected, either in CPU env or GPU env.
For now only tested on GPU env.
CPU also the same.
So it's quite confusing. Maybe you should update your branch?
llava-cliin original llama.cpp repo (master branch) works as expected, either in CPU env or GPU env.
Llava fixed, it is the side effect of my code. The new version of my PR has much fewer modifications outside the minicpm-v folder, and thus will not affect other models now.
The bug of MiniCPM-V on GPU is rather hard to find. I can reproduce the NaN issue on GPU, and here's are my observation:
- The output of the ViT when processing images is aligned with the CPU version (which means the ViT part is correct).
- The output of the LLM when processing prompt text is aligned with the CPU version (which means LLM's computation is correct on GPU).
- However, when putting the output of ViT into the LLM as LLM's input, NaN is outputted.
- I finally find that once the output of ViT is fed into the text model, it immediately becomes NaN, which means it has already turned into NaN at the input embedding stage (input_embed), without calculating any TransformerBlock.
- In the function where the ViT input is copied from the CPU to the GPU (the
ggml_backend_cuda_buffer_set_tensorfunction inggml-cuda.cu), I add a debug code to copy back the input_embed to the CPU. The result copied back to the CPU is the same as the output of ViT, no NaN is appearing. However, I can't figure out what more is happening in the code between "ViT output" and "LLM input", I can only find the CPU->GPU copying. If NaN is not from this stage, then from what? - I also attempted to allocate a new space to copy the output of ViT into, in order to avoid some "access out of bound" issues, but the result was still NaN.
@cmp-nct Hey cmp-nct, could you please help us resolve this confusing issue? Thanks a lot!
I apologize if this has caused you confusion.
ไฝ ๅฅฝ๏ผๆ่ฏ้ชไบไธไธ๏ผๆ้ๅไนๅๅๅฐๆๆๅทฎๅพๅค๏ผๆไปไน่งฃๅณๆนๆณๅ๏ผ