MiniCPM-V [BUG] llama.cpp inference crashed for minicpm-v 2.6

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

[X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

tested on ollama with OpenBMB/llama.cpp (branch minicpmv-main[58a14c37]), compiled with [go-1.22.1, gcc-11.4.0, cmake-3.24.3]
compile success but image embedding always failed on llama_get_logits_ith: invalid logits id X, reason: no logits. error may occur in function llama_sampling_prepare
llama.cpp build target llama-minicpmv-cli work as intended, image embedding functional
when logits is bypassed, ollama will run without image context

期望行为 | Expected Behavior

ollama should run with image context

复现方法 | Steps To Reproduce

compile ollama with OpenBMB/llama.cpp (branch minicpmv-main[[58a14c37]])
import ollama model with the template provided
load model and type any text, llama.cpp worker may crash without any response

运行环境 | Environment

- OS:ubuntu-20.0.4
- CUDA 12.4
- go-1.22.1,  gcc-11.4.0, cmake-3.24.3

备注 | Anything else?

i tried manually run the llama.cpp worker without --embeddings flag, context would work witout image

` 2024/08/06 21:50:04 routes.go:1028: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:*] OLLAMA_RUNNERS_DIR: OLLAMA_TMPDIR:]" time=2024-08-06T21:50:04.127Z level=INFO source=images.go:729 msg="total blobs: 0" time=2024-08-06T21:50:04.127Z level=INFO source=images.go:736 msg="total unused blobs removed: 0" [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.

using env: export GIN_MODE=release
using code: gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(*Server).PullModelHandler-fm (5 handlers) [GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers) [GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers) [GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers) [GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(*Server).CreateModelHandler-fm (5 handlers) [GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(*Server).PushModelHandler-fm (5 handlers) [GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(*Server).CopyModelHandler-fm (5 handlers) [GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(*Server).DeleteModelHandler-fm (5 handlers) [GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (5 handlers) [GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers) [GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers) [GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(*Server).ProcessHandler-fm (5 handlers) [GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers) [GIN-debug] GET / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers) [GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers) [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) time=2024-08-06T21:50:04.127Z level=INFO source=routes.go:1074 msg="Listening on 127.0.0.1:11434 (version 0.0.0)" time=2024-08-06T21:50:04.127Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama3701365928/runners time=2024-08-06T21:50:09.571Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 cuda_v12]" time=2024-08-06T21:50:10.152Z level=INFO source=types.go:71 msg="inference compute" id=GPU-1268fbe9-3b43-1444-f590-d5b2df97ff2c library=cuda compute=8.6 driver=12.4 name="NVIDIA GeForce RTX 3090" total="23.7 GiB" available="21.6 GiB" time=2024-08-06T21:50:10.152Z level=INFO source=types.go:71 msg="inference compute" id=GPU-5f788662-7221-1a92-9545-0ec9adae0ada library=cuda compute=8.6 driver=12.4 name="NVIDIA GeForce RTX 3090" total="23.7 GiB" available="21.7 GiB" time=2024-08-06T21:50:10.152Z level=INFO source=types.go:71 msg="inference compute" id=GPU-66a2e9ca-3e0b-d51c-ad7f-a637f711c421 library=cuda compute=8.6 driver=12.4 name="NVIDIA GeForce RTX 3090" total="23.7 GiB" available="20.3 GiB" time=2024-08-06T21:50:10.152Z level=INFO source=types.go:71 msg="inference compute" id=GPU-f13f4afd-9a14-f9cd-6984-1a5463e61bc1 library=cuda compute=6.1 driver=12.4 name="Tesla P4" total="7.4 GiB" available="6.3 GiB" [GIN] 2024/08/06 - 21:52:06 | 200 | 56.302µs | 127.0.0.1 | HEAD "/" [GIN] 2024/08/06 - 21:52:18 | 201 | 8.088493131s | 127.0.0.1 | POST "/api/blobs/sha256:3a4078d53b46f22989adbf998ce5a3fd090b6541f112d7e936eb4204a04100b1" [GIN] 2024/08/06 - 21:52:20 | 201 | 1.711583409s | 127.0.0.1 | POST "/api/blobs/sha256:f8a805e9e62085805c69c427287acefc284932eb4abfe6e1b1ce431d27e2f4e0" [GIN] 2024/08/06 - 21:52:38 | 200 | 18.163524271s | 127.0.0.1 | POST "/api/create" [GIN] 2024/08/06 - 21:52:53 | 200 | 48.731µs | 127.0.0.1 | HEAD "/" [GIN] 2024/08/06 - 21:52:53 | 200 | 1.272054ms | 127.0.0.1 | POST "/api/show" [GIN] 2024/08/06 - 21:52:53 | 200 | 462.199µs | 127.0.0.1 | POST "/api/show" time=2024-08-06T21:52:55.412Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=29 memory.available="23.4 GiB" memory.required.full="6.5 GiB" memory.required.partial="6.5 GiB" memory.required.kv="448.0 MiB" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="425.3 MiB" memory.graph.full="478.0 MiB" memory.graph.partial="728.5 MiB" time=2024-08-06T21:52:55.422Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=29 memory.available="23.4 GiB" memory.required.full="6.5 GiB" memory.required.partial="6.5 GiB" memory.required.kv="448.0 MiB" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="425.3 MiB" memory.graph.full="478.0 MiB" memory.graph.partial="728.5 MiB" time=2024-08-06T21:52:55.422Z level=WARN source=server.go:227 msg="multimodal models don't support parallel requests yet" time=2024-08-06T21:52:55.422Z level=INFO source=server.go:338 msg="starting llama server" cmd="/tmp/ollama3701365928/runners/cuda_v12/ollama_llama_server --model /root/.ollama/models/blobs/sha256-3a4078d53b46f22989adbf998ce5a3fd090b6541f112d7e936eb4204a04100b1 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 29 --mmproj /root/.ollama/models/blobs/sha256-f8a805e9e62085805c69c427287acefc284932eb4abfe6e1b1ce431d27e2f4e0 --parallel 1 --port 38327" time=2024-08-06T21:52:55.423Z level=INFO source=sched.go:338 msg="loaded runners" count=1 time=2024-08-06T21:52:55.423Z level=INFO source=server.go:525 msg="waiting for llama runner to start responding" time=2024-08-06T21:52:55.424Z level=INFO source=server.go:562 msg="waiting for server to become available" status="llm server error" INFO [main] build info | build=3271 commit="1781edb6" tid="140379126362112" timestamp=1722981175 INFO [main] system info | n_threads=32 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140379126362112" timestamp=1722981175 total_threads=64 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="63" port="38327" tid="140379126362112" timestamp=1722981175 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes ggml_cuda_init: CUDA_USE_TENSOR_CORES: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes time=2024-08-06T21:52:55.926Z level=INFO source=server.go:562 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: loaded meta data with 22 key-value pairs and 339 tensors from /root/.ollama/models/blobs/sha256-3a4078d53b46f22989adbf998ce5a3fd090b6541f112d7e936eb4204a04100b1 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.name str = model llama_model_loader: - kv 2: qwen2.block_count u32 = 28 llama_model_loader: - kv 3: qwen2.context_length u32 = 32768 llama_model_loader: - kv 4: qwen2.embedding_length u32 = 3584 llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 18944 llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 28 llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 4 llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 10: general.file_type u32 = 15 llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,151666] = ["!", """, "#", "$", "%", "&", "'", ... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,151666] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 151644 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 128244 llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.chat_template str = {% for message in messages %}{% if lo... llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 141 tensors llama_model_loader: - type q4_K: 169 tensors llama_model_loader: - type q6_K: 29 tensors llm_load_vocab: special tokens cache size = 25 llm_load_vocab: token to piece cache size = 0.9309 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = qwen2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 151666 llm_load_print_meta: n_merges = 151387 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 3584 llm_load_print_meta: n_head = 28 llm_load_print_meta: n_head_kv = 4 llm_load_print_meta: n_layer = 28 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 7 llm_load_print_meta: n_embd_k_gqa = 512 llm_load_print_meta: n_embd_v_gqa = 512 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 18944 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 7.61 B llm_load_print_meta: model size = 4.35 GiB (4.91 BPW) llm_load_print_meta: general.name = model llm_load_print_meta: BOS token = 151644 '<|im_start|>' llm_load_print_meta: EOS token = 151645 '<|im_end|>' llm_load_print_meta: UNK token = 128244 '' llm_load_print_meta: PAD token = 0 '!' llm_load_print_meta: LF token = 148848 'ÄĬ' llm_load_print_meta: EOT token = 151645 '<|im_end|>' llm_load_print_meta: max token length = 256 llm_load_tensors: ggml ctx size = 0.30 MiB llm_load_tensors: offloading 28 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 29/29 layers to GPU llm_load_tensors: CPU buffer size = 291.59 MiB llm_load_tensors: CUDA0 buffer size = 4166.97 MiB .................................................................................... llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 448.00 MiB llama_new_context_with_model: KV self size = 448.00 MiB, K (f16): 224.00 MiB, V (f16): 224.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.01 MiB llama_new_context_with_model: CUDA0 compute buffer size = 492.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 23.01 MiB llama_new_context_with_model: graph nodes = 986 llama_new_context_with_model: graph splits = 2 INFO [main] model loaded | tid="140379126362112" timestamp=1722981177 time=2024-08-06T21:52:57.983Z level=INFO source=server.go:567 msg="llama runner started in 2.56 seconds" [GIN] 2024/08/06 - 21:52:57 | 200 | 4.757751855s | 127.0.0.1 | POST "/api/chat" llama_get_logits_ith: invalid logits id 10, reason: no logits`

Aug 06 '24 22:08 luixiao0

https://github.com/OpenBMB/llama.cpp/tree/minicpmv-main-no-video-inference

ffmpeg可能安装比较麻烦，我拆出来一个独立的分支，虽然只能像以前一样识别图片，但应该可以试试这个。

Aug 08 '24 06:08 tc-mb

testing locally

Aug 08 '24 09:08 luixiao0

我出现了同样的问题，并且将minicpmv-main分支修改为使用minicpmv-main-no-video-inference分支依旧无效

https://github.com/OpenBMB/llama.cpp/tree/minicpmv-main-no-video-inference

ffmpeg可能安装比较麻烦，我拆出来一个独立的分支，虽然只能像以前一样识别图片，但应该可以试试这个。

Aug 08 '24 14:08 cr-zhichen

抱歉，我刚开始可能没看清楚。我会尽快在check下ollama这个分支，我开始以为只是llamacpp会跳出的问题。这几天刚开源，对我们团队来说，一下子问题有点多，我们会尽快解决。

Aug 08 '24 15:08 tc-mb

i am not a professional cpp programmer, the following discoveries may not correct, i cherry-picked cpm-v2.6 code to 2.5 branch(quatntized gguf from hf) and makes embedding working

maintainers of llama.cpp changed embedding methods(for pooling?) recently, it's unlucky for OpenBMB since they are refactoring the entire multimodal implementation, the root cause may about how llama.cpp process visual embedding, i tested cpm2.6 without calling for visual embedding and works fine
i think injection of minicpmv_version isn't working properly for minicpmv-2.6 (it should be 3 accroding to model in hf), hidden_size is incorrect, i am trying to patch them, no luck at forcing the version number = 3),see logs below

` inference on the same ollama binary

ollama run aiden_lu/minicpm-v2.6:Q4_K_M

hi GPTG:GPTGGPTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

ollama run hhao/openbmb-minicpm-llama3-v-2_5:latest

hi Hello! How can I assist you today? Do you have any questions or topics you'd like to discuss? I'm a large language model trained by OpenAI and I'm here to help with whatever you need. Just let me know what's on your mind.

and i checked cpmv2.5 image embedding is also working `

Aug 08 '24 23:08 luixiao0

抱歉，我刚开始可能没看清楚。我会尽快在check下ollama这个分支，我开始以为只是llamacpp会跳出的问题。这几天刚开源，对我们团队来说，一下子问题有点多，我们会尽快解决。

No worry, excellent work, python inference works outof the box anyway 😆

Aug 09 '24 01:08 luixiao0

successful run on ollama with latest commit, great work

Aug 13 '24 06:08 luixiao0

i am pushing a release at my fork for x86_64_cu_124 update: i have compiled a working version as a temporarily fix and uploaded to my fork, feel free to check it out https://github.com/luixiao0/ollama/releases update update: i have pushed docker image for minicpm-v-2.6: docker.io/luixiao/ollama

Aug 13 '24 06:08 luixiao0

me too ，have the same issue llama_get_logits_ith: invalid logits id X, reason: no logits

Aug 13 '24 07:08 shadowwider

I found if I directly use llama.cpp git clone -b minicpmv-main https://github.com/OpenBMB/llama.cpp.git , like ./llama-minicpmv-cli -m MiniCPM-V-2_6-gguf/ggml-model-f16.gguf --mmproj MiniCPM-V-2_6-gguf/mmproj-model-f16.gguf -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image 'image/Screenshot 2024-07-16 at 11.52.51.png' -p "What is in the image?" """, then everything works fine.

But when I use ollama git clone -b minicpm-v2.6 https://github.com/OpenBMB/ollama.git with the llama.cpp, even if there is only x86 cpu, I would always got llama_get_logits_ith: invalid logits id X, reason: no logits. I have tried both v2.5 and v2.6 and got the same error. Seems the ollama fork has some issue.

Aug 13 '24 08:08 xunuohope1107

I found if I directly use llama.cpp git clone -b minicpmv-main https://github.com/OpenBMB/llama.cpp.git , like ./llama-minicpmv-cli -m MiniCPM-V-2_6-gguf/ggml-model-f16.gguf --mmproj MiniCPM-V-2_6-gguf/mmproj-model-f16.gguf -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image 'image/Screenshot 2024-07-16 at 11.52.51.png' -p "What is in the image?" """, then everything works fine.

But when I use ollama git clone -b minicpm-v2.6 https://github.com/OpenBMB/ollama.git with the llama.cpp, even if there is only x86 cpu, I would always got llama_get_logits_ith: invalid logits id X, reason: no logits. I have tried both v2.5 and v2.6 and got the same error. Seems the ollama fork has some issue.

it was patch from ollama was deleted at openBMB fork, try to use the master branch from ollama and ggerganov, then merge openBMB's work manually, and modify the build Dockerfile to pull your branch, you may get a working copy of 'future' ollama

Aug 13 '24 08:08 luixiao0

i am pushing a release at my fork for x86_64_cu_124 update: i have compiled a working version as a temporarily fix and uploaded to my fork, feel free to check it out https://github.com/luixiao0/ollama/releases Can you run minicpm-v 2.6 with Ollama？ I build from your git ，but when create model get error like this How to do with this?

Aug 13 '24 09:08 shadowwider

i am pushing a release at my fork for x86_64_cu_124 update: i have compiled a working version as a temporarily fix and uploaded to my fork, feel free to check it out https://github.com/luixiao0/ollama/releases Can you run minicpm-v 2.6 with Ollama？ I build from your git ，but when create model get error like this How to do with this?

Also get the same error. Modelfile is copied from OpenBMB:

FROM ../../MiniCPM-V-2_6-gguf/ggml-model-Q4_K_M.gguf
FROM ../../MiniCPM-V-2_6-gguf/mmproj-model-f16.gguf

TEMPLATE """{{ if .System }}<|im_start|>system

{{ .System }}<|im_end|>{{ end }}

{{ if .Prompt }}<|im_start|>user

{{ .Prompt }}<|im_end|>{{ end }}

<|im_start|>assistant<|im_end|>

{{ .Response }}<|im_end|>"""

PARAMETER stop "<|endoftext|>"
PARAMETER stop "<|im_end|>"
PARAMETER num_ctx 8192

Aug 13 '24 09:08 xunuohope1107

i am pushing a release at my fork for x86_64_cu_124 update: i have compiled a working version as a temporarily fix and uploaded to my fork, feel free to check it out https://github.com/luixiao0/ollama/releases Can you run minicpm-v 2.6 with Ollama？ I build from your git ，but when create model get error like this How to do with this?

Also get the same error. Modelfile is copied from OpenBMB:
FROM ../../MiniCPM-V-2_6-gguf/ggml-model-Q4_K_M.gguf
FROM ../../MiniCPM-V-2_6-gguf/mmproj-model-f16.gguf

TEMPLATE """{{ if .System }}<|im_start|>system

{{ .System }}<|im_end|>{{ end }}

{{ if .Prompt }}<|im_start|>user

{{ .Prompt }}<|im_end|>{{ end }}

<|im_start|>assistant<|im_end|>

{{ .Response }}<|im_end|>"""

PARAMETER stop "<|endoftext|>"
PARAMETER stop "<|im_end|>"
PARAMETER num_ctx 8192

I also tried with absolute path of model path

Aug 13 '24 09:08 xunuohope1107

i am pushing a release at my fork for x86_64_cu_124 update: i have compiled a working version as a temporarily fix and uploaded to my fork, feel free to check it out https://github.com/luixiao0/ollama/releases Can you run minicpm-v 2.6 with Ollama？ I build from your git ，but when create model get error like this How to do with this?

try aiden_lu/minicpm-v2.6 please openBMB updated their GGUF recently, the file in aiden_lu/minicpm-v2.6 is quant from their earlier version of code, in my case it worked, you can get the same file from their hf history

in fact, i am getting the same error from the latest GGUF published by OpenBMB,but output from the older version should be same

Aug 13 '24 09:08 luixiao0

check this out 😊，it runs with single RTX 3090, multiple GPU should work, but not tested in my machine, also check the docker version [https://hub.docker.com/r/luixiao/ollama](https://hub.docker.com/r/luixiao/ollama)

Aug 13 '24 09:08 luixiao0

check this out 😊，it runs with single RTX 3090, multiple GPU should work, but not tested in my machine

Looks great! Are you use the master or minicpmv-main branch of https://github.com/OpenBMB/llama.cpp.git? Or the official llama.cpp repo? I got Error: llama runner process has terminated: GGML_ASSERT(new_clip->has_llava_projector) failed when execute ollama run aiden_lu/minicpm-v2.6:Q4_K_M

Aug 13 '24 09:08 xunuohope1107

check this out 😊，it runs with single RTX 3090, multiple GPU should work, but not tested in my machine

Looks great! Are you use the master or minicpmv-main branch of https://github.com/OpenBMB/llama.cpp.git? Or the official llama.cpp repo? I got Error: llama runner process has terminated: GGML_ASSERT(new_clip->has_llava_projector) failed when execute ollama run aiden_lu/minicpm-v2.6:Q4_K_M

as commented above sad that i am using the master branch from ollama and ggerganov, not OpenBMB as my starting point, then merge openBMB's work manually, at last modify the build Dockerfile to pull your branch, you may get a working copy of 'future' ollama

Aug 13 '24 10:08 luixiao0

check this out 😊，it runs with single RTX 3090, multiple GPU should work, but not tested in my machine

Looks great! Are you use the master or minicpmv-main branch of https://github.com/OpenBMB/llama.cpp.git? Or the official llama.cpp repo? I got Error: llama runner process has terminated: GGML_ASSERT(new_clip->has_llava_projector) failed when execute ollama run aiden_lu/minicpm-v2.6:Q4_K_M

as commented above sad that i am using the master branch from ollama and ggerganov, not OpenBMB as my starting point, then merge openBMB's work manually, at last modify the build Dockerfile to pull your branch, you may get a working copy of 'future' ollama

Thanks for your comment! So the merged llama.cpp work is on repo https://github.com/luixiao0/llama.cpp, right?

Aug 13 '24 10:08 xunuohope1107

check this out 😊，it runs with single RTX 3090, multiple GPU should work, but not tested in my machine

Looks great! Are you use the master or minicpmv-main branch of https://github.com/OpenBMB/llama.cpp.git? Or the official llama.cpp repo? I got Error: llama runner process has terminated: GGML_ASSERT(new_clip->has_llava_projector) failed when execute ollama run aiden_lu/minicpm-v2.6:Q4_K_M

as commented above sad that i am using the master branch from ollama and ggerganov, not OpenBMB as my starting point, then merge openBMB's work manually, at last modify the build Dockerfile to pull your branch, you may get a working copy of 'future' ollama

Thanks for your comment! So the merged llama.cpp work is on repo https://github.com/luixiao0/llama.cpp, right?

yeah, feel free to check my fork https://github.com/luixiao0/ollama you can find the llama.cpp module is redirected to my fork of llama.cpp i update the cmake/go/cuda version and change the build docker from centos to ubuntu( i find centos7 build wont pass due to package tools EOL) the Dockerfile is a little bit messed up since i dont need to build rocm or arm, you can uncomment them

Aug 13 '24 10:08 luixiao0

check this out 😊，it runs with single RTX 3090, multiple GPU should work, but not tested in my machine

Looks great! Are you use the master or minicpmv-main branch of https://github.com/OpenBMB/llama.cpp.git? Or the official llama.cpp repo? I got Error: llama runner process has terminated: GGML_ASSERT(new_clip->has_llava_projector) failed when execute ollama run aiden_lu/minicpm-v2.6:Q4_K_M

I've tried your solution, but I'm still encountering the same error. Whether I build the container using a Dockerfile or directly build and run the model, the issue remains. It's quite strange, as the code should be identical in both scenarios. I'm using the code from the repositories https://github.com/luixiao0/ollama and https://github.com/luixiao0/llama.cpp, and the build process completes without any errors.

Aug 14 '24 06:08 shadowwider

check this out 😊，it runs with single RTX 3090, multiple GPU should work, but not tested in my machine

Looks great! Are you use the master or minicpmv-main branch of https://github.com/OpenBMB/llama.cpp.git? Or the official llama.cpp repo? I got Error: llama runner process has terminated: GGML_ASSERT(new_clip->has_llava_projector) failed when execute ollama run aiden_lu/minicpm-v2.6:Q4_K_M

I've tried your solution, but I'm still encountering the same error. Whether I build the container using a Dockerfile or directly build and run the model, the issue remains. It's quite strange, as the code should be identical in both scenarios. I'm using the code from the repositories https://github.com/luixiao0/ollama and https://github.com/luixiao0/llama.cpp, and the build process completes without any errors.

try run git submodule update --init --recursive before compile

Aug 14 '24 06:08 luixiao0