MiniCPM-V icon indicating copy to clipboard operation
MiniCPM-V copied to clipboard

Where to Get MiniCPM-V 4.5 to run with Ollama?

Open chigkim opened this issue 4 months ago • 7 comments

Readme: "MiniCPM-V 4.5 can be easily used in various ways: (1) llama.cpp and ollama support for efficient CPU inference on local devices."

Where can I get the model to run with Ollama? I only see MiniCPM-V 2.6 on Ollama library.

Thanks!

chigkim avatar Aug 25 '25 22:08 chigkim

MiniCPM-V 4.5 is now available on Ollama. You can access the model via this link. You can also run MiniCPM-V 4.5 directly on Ollama using this command: ollama run openbmb/minicpm-v4.5

ZMXJJ avatar Aug 26 '25 03:08 ZMXJJ

It shows 500 Internal Server Error: llama runner process has terminated: exit status 0xc0000409 after using this command: ollama run openbmb/minicpm-v4.5

fatton avatar Aug 26 '25 03:08 fatton

@chigkim @fatton https://github.com/tc-mb/ollama/tree/MIniCPM-V https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/ollama/minicpm-v4_5_ollama.md

We've just submitted a pull request, and it hasn't been merged into ollama yet. You can use the environment in this branch. For instructions, refer to this document.

tc-mb avatar Aug 26 '25 06:08 tc-mb

How should I send video to Ollama? Can I break up a video into multiple images and send it via Ollama API? If so, how many images per second?

chigkim avatar Aug 26 '25 22:08 chigkim

@chigkim

  1. Ollama doesn't have a video tag yet.
  2. You can crop images.
  3. You can specify how many images to crop per second.

tc-mb avatar Aug 27 '25 01:08 tc-mb

Thanks for letting me know that Ollama cannot process video directly. However, I was wondering if you could split a video into multiple images, send them to Ollama, and have it analyze them as a video clip? If so, how many images per second would need to be sent? Thanks!

chigkim avatar Aug 27 '25 13:08 chigkim

I think the video can be split. How many frames are sent per second depends on your task. If you need high refresh rate, you can capture more frames and input them into the model.

tc-mb avatar Aug 27 '25 13:08 tc-mb

@chigkim @fatton https://github.com/tc-mb/ollama/tree/MIniCPM-V https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/ollama/minicpm-v4_5_ollama.md

We've just submitted a pull request, and it hasn't been merged into ollama yet. You can use the environment in this branch. For instructions, refer to this document.我们刚刚提交了一个 pull request,但它尚未合并到 ollama 中。您可以使用此分支中的环境。有关说明,请参考此文档。

这个本机编译的性能比安装官方的差了好多,所有模型吐字都慢了很多 能不能发一个可安装的文件啊

NormanMises avatar Sep 12 '25 02:09 NormanMises

@NormanMises 速度问题不太应该,可以问下编译在什么机器上,有可能是编译选项和安装版本有不同么?

tc-mb avatar Sep 12 '25 04:09 tc-mb

@NormanMises 速度问题不太应该,可以问下编译在什么机器上,有可能是编译选项和安装版本有不同么?

Ubuntu20.04 A800 拉的最新代码编译的

Image 左侧是官方安装,右侧是编译的,这差距

NormanMises avatar Sep 12 '25 04:09 NormanMises

@NormanMises 如果使用多模态模型,我可以来帮您debug,关于文本模型的效率问题,我不是很清楚。 您可以到https://github.com/OpenBMB/MiniCPM 里面提issue。

tc-mb avatar Sep 12 '25 05:09 tc-mb

@NormanMises 如果使用多模态模型,我可以来帮您debug,关于文本模型的效率问题,我不是很清楚。 您可以到https://github.com/OpenBMB/MiniCPM 里面提issue。

MiniCPM-V 4.5 吐字速度也很慢

带图片的消息过了几分钟都吐不出字

Image

NormanMises avatar Sep 12 '25 06:09 NormanMises

@NormanMises 如果是正常部署的话,几分钟这个速度是不应该的。感觉还是哪里编译问题,您可以排查下是否正确使用cuda推理了么。

tc-mb avatar Sep 12 '25 06:09 tc-mb

@NormanMises 如果是正常部署的话,几分钟这个速度是不应该的。感觉还是哪里编译问题,您可以排查下是否正确使用cuda推理了么。

确实不对劲 请求发过去了但模型不加载


llama_model_loader: loaded meta data with 25 key-value pairs and 399 tensors from/Ollama/models/blobs/sha256-c1c3c33100b15b4caf7319acce4e23c0eb0ce1cbd12f70e8d24f05aa67b7512f (version GGUF V3 (lat
est))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Model
llama_model_loader: - kv   3:                         general.size_label str              = 8.2B
llama_model_loader: - kv   4:                          qwen3.block_count u32              = 36
llama_model_loader: - kv   5:                       qwen3.context_length u32              = 40960
llama_model_loader: - kv   6:                     qwen3.embedding_length u32              = 4096
llama_model_loader: - kv   7:                  qwen3.feed_forward_length u32              = 12288
llama_model_loader: - kv   8:                 qwen3.attention.head_count u32              = 32
llama_model_loader: - kv   9:              qwen3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:                       qwen3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:     qwen3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  12:                 qwen3.attention.key_length u32              = 128                                                                                                        [170/7880]
llama_model_loader: - kv  13:               qwen3.attention.value_length u32              = 128
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,151748]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,151748]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  18:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - kv  24:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  145 tensors
llama_model_loader: - type q4_K:  217 tensors
llama_model_loader: - type q6_K:   37 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 4.68 GiB (4.90 BPW)
load: printing all EOG tokens:
load:   - 151643 ('<|endoftext|>')
load:   - 151645 ('<|im_end|>')
load:   - 151662 ('<|fim_pad|>')
load:   - 151663 ('<|repo_name|>')
load:   - 151664 ('<|file_sep|>')
load: special tokens cache size = 106
load: token to piece cache size = 0.9319 MB
print_info: arch             = qwen3
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 8.19 B
print_info: general.name     = Model
print_info: vocab type       = BPE
print_info: n_vocab          = 151748
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'                                                                                                                                                  [125/7880]
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-09-12T14:51:13.944+08:00 level=INFO source=server.go:216 msg="enabling flash attention"
time=2025-09-12T14:51:13.945+08:00 level=WARN source=server.go:224 msg="kv cache type not supported by model" type=""
time=2025-09-12T14:51:13.945+08:00 level=INFO source=server.go:398 msg="starting runner" cmd="/22yanhongfei/ollama/ollama runner --model/Ollama/models/blobs/sha256-c1c3c33100b15b4caf7319acce4e23c
0eb0ce1cbd12f70e8d24f05aa67b7512f --port 36825"
time=2025-09-12T14:51:13.964+08:00 level=INFO source=runner.go:864 msg="starting go runner"
time=2025-09-12T14:51:13.965+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.LLAMAFILE=1 compiler=cgo(gcc)
time=2025-09-12T14:51:13.992+08:00 level=INFO source=runner.go:900 msg="Server listening on 127.0.0.1:36825"
time=2025-09-12T14:51:14.251+08:00 level=INFO source=server.go:503 msg="system memory" total="1007.5 GiB" free="976.6 GiB" free_swap="0 B"
time=2025-09-12T14:51:14.528+08:00 level=INFO source=memory.go:36 msg="new model will fit in available VRAM across minimum required GPUs, loading" model=/22yanhongfei/Ollama/models/blobs/sha256-c1c3c33100b15b4
caf7319acce4e23c0eb0ce1cbd12f70e8d24f05aa67b7512f library=cuda parallel=1 required="6.9 GiB" gpus=1
time=2025-09-12T14:51:14.824+08:00 level=INFO source=server.go:543 msg=offload library=cuda layers.requested=-1 layers.model=37 layers.offload=37 layers.split=[37] memory.available="[63.4 GiB]" memory.gpu_over
head="0 B" memory.required.full="6.9 GiB" memory.required.partial="6.9 GiB" memory.required.kv="576.0 MiB" memory.required.allocations="[6.9 GiB]" memory.weights.total="4.4 GiB" memory.weights.repeating="3.9 G
iB" memory.weights.nonrepeating="486.3 MiB" memory.graph.full="384.0 MiB" memory.graph.partial="384.0 MiB" projector.weights="1.0 GiB" projector.graph="0 B"
time=2025-09-12T14:51:14.827+08:00 level=INFO source=runner.go:799 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:4096 KvCacheType: NumThreads:64 GPULayers:
37[ID:GPU-a3761774-edd3-7b4a-faa8-ea26dfcd247f Layers:37(0..36)] MultiUserCache:false ProjectorPath:/22yanhongfei/Ollama/models/blobs/sha256-7a7225a32e8d453aaa3d22d8c579b5bf833c253f784cdb05c99c9a76fd616df8 Mai
nGPU:0 UseMmap:true}"
time=2025-09-12T14:51:14.828+08:00 level=INFO source=server.go:1250 msg="waiting for llama runner to start responding"
time=2025-09-12T14:51:14.828+08:00 level=INFO source=server.go:1284 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: loaded meta data with 25 key-value pairs and 399 tensors from/Ollama/models/blobs/sha256-c1c3c33100b15b4caf7319acce4e23c0eb0ce1cbd12f70e8d24f05aa67b7512f (version GGUF V3 (lat
est))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Model
llama_model_loader: - kv   3:                         general.size_label str              = 8.2B
llama_model_loader: - kv   4:                          qwen3.block_count u32              = 36
llama_model_loader: - kv   5:                       qwen3.context_length u32              = 40960
llama_model_loader: - kv   6:                     qwen3.embedding_length u32              = 4096
llama_model_loader: - kv   7:                  qwen3.feed_forward_length u32              = 12288                                                                                                       [90/7880]
llama_model_loader: - kv   8:                 qwen3.attention.head_count u32              = 32
llama_model_loader: - kv   9:              qwen3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:                       qwen3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:     qwen3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  12:                 qwen3.attention.key_length u32              = 128
llama_model_loader: - kv  13:               qwen3.attention.value_length u32              = 128
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,151748]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,151748]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  18:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - kv  24:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  145 tensors
llama_model_loader: - type q4_K:  217 tensors
llama_model_loader: - type q6_K:   37 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 4.68 GiB (4.90 BPW)
load: printing all EOG tokens:
load:   - 151643 ('<|endoftext|>')
load:   - 151645 ('<|im_end|>')
load:   - 151662 ('<|fim_pad|>')
load:   - 151663 ('<|repo_name|>')
load:   - 151664 ('<|file_sep|>')
load: special tokens cache size = 106
load: token to piece cache size = 0.9319 MB
print_info: arch             = qwen3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 40960
print_info: n_embd           = 4096
print_info: n_layer          = 36
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00                                                                                                                                                                  [40/7880]
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 12288
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = -1
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 40960
print_info: rope_finetuned   = unknown
print_info: model type       = 8B
print_info: model params     = 8.19 B
print_info: general.name     = Model
print_info: vocab type       = BPE
print_info: n_vocab          = 151748
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors:   CPU_Mapped model buffer size =  4788.17 MiB
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: kv_unified    = false
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (40960) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.59 MiB
llama_kv_cache_unified:        CPU KV buffer size =   576.00 MiB
llama_kv_cache_unified: size =  576.00 MiB (  4096 cells,  36 layers,  1/1 seqs), K (f16):  288.00 MiB, V (f16):  288.00 MiB
llama_context:        CPU compute buffer size =   304.38 MiB
llama_context: graph nodes  = 1267
llama_context: graph splits = 1
clip_model_loader: model name:
clip_model_loader: description:  image encoder for MiniCPM-V
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    455
clip_model_loader: n_kv:         20

clip_model_loader: has vision encoder
clip_ctx: CLIP using CPU backend
load_hparams: projector:          resampler
load_hparams: n_embd:             1152
load_hparams: n_head:             16
load_hparams: n_ff:               4304
load_hparams: n_layer:            27
load_hparams: ffn_op:             gelu
load_hparams: projection_dim:     0

--- vision hparams ---
load_hparams: image_size:         448
load_hparams: patch_size:         14
load_hparams: has_llava_proj:     0
load_hparams: minicpmv_version:   6
load_hparams: proj_scale_factor:  0
load_hparams: n_wa_pattern:       0

load_hparams: model size:         1044.36 MiB
load_hparams: metadata size:      0.16 MiB
alloc_compute_meta:        CPU compute buffer size =   100.30 MiB
time=2025-09-12T14:51:16.835+08:00 level=INFO source=server.go:1288 msg="llama runner started in 2.89 seconds"
time=2025-09-12T14:51:16.835+08:00 level=INFO source=sched.go:473 msg="loaded runners" count=1
time=2025-09-12T14:51:16.835+08:00 level=INFO source=server.go:1250 msg="waiting for llama runner to start responding"
time=2025-09-12T14:51:16.835+08:00 level=INFO source=server.go:1288 msg="llama runner started in 2.89 seconds"
mtmd_encode_chunk has no effect for text chunks

NormanMises avatar Sep 12 '25 06:09 NormanMises

@NormanMises 如果是正常部署的话,几分钟这个速度是不应该的。感觉还是哪里编译问题,您可以排查下是否正确使用cuda推理了么。

奇怪啊 重新编译了 还是不用cuda11.8推理 要是能发个安装包就好了

o怪不得这么慢 原来编译的默认在cpu推理... 这也没写怎么gpu推理啊

NormanMises avatar Sep 12 '25 07:09 NormanMises

@NormanMises 您的意见我们可以考虑,不过我们暂时还没越过ollama官方发自己的包。 您可以尝试用以下代码cuda编译。

cmake -B build
cmake --build build --config Release

tc-mb avatar Sep 12 '25 07:09 tc-mb

@NormanMises 您的意见我们可以考虑,不过我们暂时还没越过ollama官方发自己的包。 您可以尝试用以下代码cuda编译。

cmake -B build
cmake --build build --config Release

不用go build .了吗

NormanMises avatar Sep 12 '25 07:09 NormanMises

@NormanMises go build . 会走默认build,但您的机器上好像默认安装不会装cuda进来。 用上面的方式手动指定,如果出现报错,说明cuda环境有问题,这样可以根据报错修改。

tc-mb avatar Sep 12 '25 08:09 tc-mb

@NormanMises go build . 会走默认build,但您的机器上好像默认安装不会装cuda进来。 用上面的方式手动指定,如果出现报错,说明cuda环境有问题,这样可以根据报错修改。

报错了

[ 47%] Building CXX object ml/backend/ggml/ggml/src/CMakeFiles/ggml-cpu-icelake.dir/ggml-cpu/vec.cpp.o
[ 48%] Building CXX object ml/backend/ggml/ggml/src/CMakeFiles/ggml-cpu-icelake.dir/ggml-cpu/ops.cpp.o
[ 48%] Building CXX object ml/backend/ggml/ggml/src/CMakeFiles/ggml-cpu-icelake.dir/ggml-cpu/llamafile/sgemm.cpp.o
[ 48%] Building C object ml/backend/ggml/ggml/src/CMakeFiles/ggml-cpu-icelake.dir/ggml-cpu/arch/x86/quants.c.o
[ 49%] Building CXX object ml/backend/ggml/ggml/src/CMakeFiles/ggml-cpu-icelake.dir/ggml-cpu/arch/x86/repack.cpp.o
[ 49%] Linking CXX shared module ../../../../../lib/ollama/libggml-cpu-icelake.so
[ 49%] Built target ggml-cpu-icelake
[ 50%] Building CXX object ml/backend/ggml/ggml/src/CMakeFiles/ggml-cpu-alderlake-feats.dir/ggml-cpu/arch/x86/cpu-feats.cpp.o
[ 50%] Built target ggml-cpu-alderlake-feats
[ 51%] Building C object ml/backend/ggml/ggml/src/CMakeFiles/ggml-cpu-alderlake.dir/ggml-cpu/ggml-cpu.c.o
cc: error: unrecognized command line option '-mavxvnni'; did you mean '-mavx512vnni'?
make[2]: *** [ml/backend/ggml/ggml/src/CMakeFiles/ggml-cpu-alderlake.dir/build.make:79: ml/backend/ggml/ggml/src/CMakeFiles/ggml-cpu-alderlake.dir/ggml-cpu/ggml-cpu.c.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:709: ml/backend/ggml/ggml/src/CMakeFiles/ggml-cpu-alderlake.dir/all] Error 2
make: *** [Makefile:136: all] Error 2

NormanMises avatar Sep 12 '25 08:09 NormanMises

@NormanMises 这可以说明您的机器上可能cuda环境有些问题,所以按照gpu失败了。 这可能解释了为什么使用默认的build会速度慢。

tc-mb avatar Sep 18 '25 05:09 tc-mb