[BUG] llama.cpp inference crashed for minicpm-v 2.6
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
- [X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
- [X] 我已经搜索过FAQ | I have searched FAQ
当前行为 | Current Behavior
- tested on ollama with OpenBMB/llama.cpp (branch minicpmv-main[58a14c37]), compiled with [go-1.22.1, gcc-11.4.0, cmake-3.24.3]
- compile success but image embedding always failed on
llama_get_logits_ith: invalid logits id X, reason: no logits. error may occur in function llama_sampling_prepare - llama.cpp build target llama-minicpmv-cli work as intended, image embedding functional
- when logits is bypassed, ollama will run without image context
期望行为 | Expected Behavior
ollama should run with image context
复现方法 | Steps To Reproduce
- compile ollama with OpenBMB/llama.cpp (branch minicpmv-main[[58a14c37]])
- import ollama model with the template provided
- load model and type any text, llama.cpp worker may crash without any response
运行环境 | Environment
- OS:ubuntu-20.0.4
- CUDA 12.4
- go-1.22.1, gcc-11.4.0, cmake-3.24.3
备注 | Anything else?
i tried manually run the llama.cpp worker without --embeddings flag, context would work witout image
` 2024/08/06 21:50:04 routes.go:1028: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:*] OLLAMA_RUNNERS_DIR: OLLAMA_TMPDIR:]" time=2024-08-06T21:50:04.127Z level=INFO source=images.go:729 msg="total blobs: 0" time=2024-08-06T21:50:04.127Z level=INFO source=images.go:736 msg="total unused blobs removed: 0" [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.
[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
- using env: export GIN_MODE=release
- using code: gin.SetMode(gin.ReleaseMode)
[GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(*Server).PullModelHandler-fm (5 handlers)
[GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(*Server).CreateModelHandler-fm (5 handlers)
[GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(*Server).PushModelHandler-fm (5 handlers)
[GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(*Server).CopyModelHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(*Server).DeleteModelHandler-fm (5 handlers)
[GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (5 handlers)
[GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(*Server).ProcessHandler-fm (5 handlers)
[GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] GET / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2024-08-06T21:50:04.127Z level=INFO source=routes.go:1074 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
time=2024-08-06T21:50:04.127Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama3701365928/runners
time=2024-08-06T21:50:09.571Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 cuda_v12]"
time=2024-08-06T21:50:10.152Z level=INFO source=types.go:71 msg="inference compute" id=GPU-1268fbe9-3b43-1444-f590-d5b2df97ff2c library=cuda compute=8.6 driver=12.4 name="NVIDIA GeForce RTX 3090" total="23.7 GiB" available="21.6 GiB"
time=2024-08-06T21:50:10.152Z level=INFO source=types.go:71 msg="inference compute" id=GPU-5f788662-7221-1a92-9545-0ec9adae0ada library=cuda compute=8.6 driver=12.4 name="NVIDIA GeForce RTX 3090" total="23.7 GiB" available="21.7 GiB"
time=2024-08-06T21:50:10.152Z level=INFO source=types.go:71 msg="inference compute" id=GPU-66a2e9ca-3e0b-d51c-ad7f-a637f711c421 library=cuda compute=8.6 driver=12.4 name="NVIDIA GeForce RTX 3090" total="23.7 GiB" available="20.3 GiB"
time=2024-08-06T21:50:10.152Z level=INFO source=types.go:71 msg="inference compute" id=GPU-f13f4afd-9a14-f9cd-6984-1a5463e61bc1 library=cuda compute=6.1 driver=12.4 name="Tesla P4" total="7.4 GiB" available="6.3 GiB"
[GIN] 2024/08/06 - 21:52:06 | 200 | 56.302µs | 127.0.0.1 | HEAD "/"
[GIN] 2024/08/06 - 21:52:18 | 201 | 8.088493131s | 127.0.0.1 | POST "/api/blobs/sha256:3a4078d53b46f22989adbf998ce5a3fd090b6541f112d7e936eb4204a04100b1"
[GIN] 2024/08/06 - 21:52:20 | 201 | 1.711583409s | 127.0.0.1 | POST "/api/blobs/sha256:f8a805e9e62085805c69c427287acefc284932eb4abfe6e1b1ce431d27e2f4e0"
[GIN] 2024/08/06 - 21:52:38 | 200 | 18.163524271s | 127.0.0.1 | POST "/api/create"
[GIN] 2024/08/06 - 21:52:53 | 200 | 48.731µs | 127.0.0.1 | HEAD "/"
[GIN] 2024/08/06 - 21:52:53 | 200 | 1.272054ms | 127.0.0.1 | POST "/api/show"
[GIN] 2024/08/06 - 21:52:53 | 200 | 462.199µs | 127.0.0.1 | POST "/api/show"
time=2024-08-06T21:52:55.412Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=29 memory.available="23.4 GiB" memory.required.full="6.5 GiB" memory.required.partial="6.5 GiB" memory.required.kv="448.0 MiB" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="425.3 MiB" memory.graph.full="478.0 MiB" memory.graph.partial="728.5 MiB"
time=2024-08-06T21:52:55.422Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=29 memory.available="23.4 GiB" memory.required.full="6.5 GiB" memory.required.partial="6.5 GiB" memory.required.kv="448.0 MiB" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="425.3 MiB" memory.graph.full="478.0 MiB" memory.graph.partial="728.5 MiB"
time=2024-08-06T21:52:55.422Z level=WARN source=server.go:227 msg="multimodal models don't support parallel requests yet"
time=2024-08-06T21:52:55.422Z level=INFO source=server.go:338 msg="starting llama server" cmd="/tmp/ollama3701365928/runners/cuda_v12/ollama_llama_server --model /root/.ollama/models/blobs/sha256-3a4078d53b46f22989adbf998ce5a3fd090b6541f112d7e936eb4204a04100b1 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 29 --mmproj /root/.ollama/models/blobs/sha256-f8a805e9e62085805c69c427287acefc284932eb4abfe6e1b1ce431d27e2f4e0 --parallel 1 --port 38327"
time=2024-08-06T21:52:55.423Z level=INFO source=sched.go:338 msg="loaded runners" count=1
time=2024-08-06T21:52:55.423Z level=INFO source=server.go:525 msg="waiting for llama runner to start responding"
time=2024-08-06T21:52:55.424Z level=INFO source=server.go:562 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=3271 commit="1781edb6" tid="140379126362112" timestamp=1722981175
INFO [main] system info | n_threads=32 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140379126362112" timestamp=1722981175 total_threads=64
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="63" port="38327" tid="140379126362112" timestamp=1722981175
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
time=2024-08-06T21:52:55.926Z level=INFO source=server.go:562 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: loaded meta data with 22 key-value pairs and 339 tensors from /root/.ollama/models/blobs/sha256-3a4078d53b46f22989adbf998ce5a3fd090b6541f112d7e936eb4204a04100b1 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.name str = model
llama_model_loader: - kv 2: qwen2.block_count u32 = 28
llama_model_loader: - kv 3: qwen2.context_length u32 = 32768
llama_model_loader: - kv 4: qwen2.embedding_length u32 = 3584
llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 18944
llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 28
llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 4
llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 10: general.file_type u32 = 15
llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,151666] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,151666] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 151644
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 128244
llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.chat_template str = {% for message in messages %}{% if lo...
llama_model_loader: - kv 21: general.quantization_version u32 = 2
llama_model_loader: - type f32: 141 tensors
llama_model_loader: - type q4_K: 169 tensors
llama_model_loader: - type q6_K: 29 tensors
llm_load_vocab: special tokens cache size = 25
llm_load_vocab: token to piece cache size = 0.9309 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen2
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 151666
llm_load_print_meta: n_merges = 151387
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 3584
llm_load_print_meta: n_head = 28
llm_load_print_meta: n_head_kv = 4
llm_load_print_meta: n_layer = 28
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 7
llm_load_print_meta: n_embd_k_gqa = 512
llm_load_print_meta: n_embd_v_gqa = 512
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 18944
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 7.61 B
llm_load_print_meta: model size = 4.35 GiB (4.91 BPW)
llm_load_print_meta: general.name = model
llm_load_print_meta: BOS token = 151644 '<|im_start|>'
llm_load_print_meta: EOS token = 151645 '<|im_end|>'
llm_load_print_meta: UNK token = 128244 '
https://github.com/OpenBMB/llama.cpp/tree/minicpmv-main-no-video-inference
ffmpeg可能安装比较麻烦,我拆出来一个独立的分支,虽然只能像以前一样识别图片,但应该可以试试这个。
testing locally
我出现了同样的问题,并且将minicpmv-main分支修改为使用minicpmv-main-no-video-inference分支依旧无效
https://github.com/OpenBMB/llama.cpp/tree/minicpmv-main-no-video-inference
ffmpeg可能安装比较麻烦,我拆出来一个独立的分支,虽然只能像以前一样识别图片,但应该可以试试这个。
抱歉,我刚开始可能没看清楚。我会尽快在check下ollama这个分支,我开始以为只是llamacpp会跳出的问题。 这几天刚开源,对我们团队来说,一下子问题有点多,我们会尽快解决。
i am not a professional cpp programmer, the following discoveries may not correct, i cherry-picked cpm-v2.6 code to 2.5 branch(quatntized gguf from hf) and makes embedding working
- maintainers of llama.cpp changed embedding methods(for pooling?) recently, it's unlucky for OpenBMB since they are refactoring the entire multimodal implementation, the root cause may about how llama.cpp process visual embedding, i tested cpm2.6 without calling for visual embedding and works fine
- i think injection of minicpmv_version isn't working properly for minicpmv-2.6 (it should be 3 accroding to model in hf), hidden_size is incorrect, i am trying to patch them, no luck at forcing the version number = 3),see logs below
` inference on the same ollama binary
ollama run aiden_lu/minicpm-v2.6:Q4_K_M
hi GPTG:GPTGGPTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
ollama run hhao/openbmb-minicpm-llama3-v-2_5:latest
hi Hello! How can I assist you today? Do you have any questions or topics you'd like to discuss? I'm a large language model trained by OpenAI and I'm here to help with whatever you need. Just let me know what's on your mind.
and i checked cpmv2.5 image embedding is also working `
抱歉,我刚开始可能没看清楚。我会尽快在check下ollama这个分支,我开始以为只是llamacpp会跳出的问题。 这几天刚开源,对我们团队来说,一下子问题有点多,我们会尽快解决。
No worry, excellent work, python inference works outof the box anyway 😆
successful run on ollama with latest commit, great work
i am pushing a release at my fork for x86_64_cu_124 update: i have compiled a working version as a temporarily fix and uploaded to my fork, feel free to check it out https://github.com/luixiao0/ollama/releases update update: i have pushed docker image for minicpm-v-2.6: docker.io/luixiao/ollama
me too ,have the same issue llama_get_logits_ith: invalid logits id X, reason: no logits
I found if I directly use llama.cpp git clone -b minicpmv-main https://github.com/OpenBMB/llama.cpp.git
, like ./llama-minicpmv-cli -m MiniCPM-V-2_6-gguf/ggml-model-f16.gguf --mmproj MiniCPM-V-2_6-gguf/mmproj-model-f16.gguf -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image 'image/Screenshot 2024-07-16 at 11.52.51.png' -p "What is in the image?"
""", then everything works fine.
But when I use ollama git clone -b minicpm-v2.6 https://github.com/OpenBMB/ollama.git with the llama.cpp, even if there is only x86 cpu, I would always got llama_get_logits_ith: invalid logits id X, reason: no logits. I have tried both v2.5 and v2.6 and got the same error. Seems the ollama fork has some issue.
I found if I directly use llama.cpp
git clone -b minicpmv-main https://github.com/OpenBMB/llama.cpp.git, like./llama-minicpmv-cli -m MiniCPM-V-2_6-gguf/ggml-model-f16.gguf --mmproj MiniCPM-V-2_6-gguf/mmproj-model-f16.gguf -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image 'image/Screenshot 2024-07-16 at 11.52.51.png' -p "What is in the image?"""", then everything works fine.But when I use ollama
git clone -b minicpm-v2.6 https://github.com/OpenBMB/ollama.gitwith the llama.cpp, even if there is only x86 cpu, I would always gotllama_get_logits_ith: invalid logits id X, reason: no logits. I have tried both v2.5 and v2.6 and got the same error. Seems the ollama fork has some issue.
it was patch from ollama was deleted at openBMB fork, try to use the master branch from ollama and ggerganov, then merge openBMB's work manually, and modify the build Dockerfile to pull your branch, you may get a working copy of 'future' ollama
i am pushing a release at my fork for x86_64_cu_124 update: i have compiled a working version as a temporarily fix and uploaded to my fork, feel free to check it out https://github.com/luixiao0/ollama/releases Can you run minicpm-v 2.6 with Ollama? I build from your git ,but when create model get error like this
How to do with this?
i am pushing a release at my fork for x86_64_cu_124 update: i have compiled a working version as a temporarily fix and uploaded to my fork, feel free to check it out https://github.com/luixiao0/ollama/releases Can you run minicpm-v 2.6 with Ollama? I build from your git ,but when create model get error like this
How to do with this?
Also get the same error. Modelfile is copied from OpenBMB:
FROM ../../MiniCPM-V-2_6-gguf/ggml-model-Q4_K_M.gguf
FROM ../../MiniCPM-V-2_6-gguf/mmproj-model-f16.gguf
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>{{ end }}
{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>{{ end }}
<|im_start|>assistant<|im_end|>
{{ .Response }}<|im_end|>"""
PARAMETER stop "<|endoftext|>"
PARAMETER stop "<|im_end|>"
PARAMETER num_ctx 8192
i am pushing a release at my fork for x86_64_cu_124 update: i have compiled a working version as a temporarily fix and uploaded to my fork, feel free to check it out https://github.com/luixiao0/ollama/releases Can you run minicpm-v 2.6 with Ollama? I build from your git ,but when create model get error like this
How to do with this?
Also get the same error. Modelfile is copied from OpenBMB:
FROM ../../MiniCPM-V-2_6-gguf/ggml-model-Q4_K_M.gguf FROM ../../MiniCPM-V-2_6-gguf/mmproj-model-f16.gguf TEMPLATE """{{ if .System }}<|im_start|>system {{ .System }}<|im_end|>{{ end }} {{ if .Prompt }}<|im_start|>user {{ .Prompt }}<|im_end|>{{ end }} <|im_start|>assistant<|im_end|> {{ .Response }}<|im_end|>""" PARAMETER stop "<|endoftext|>" PARAMETER stop "<|im_end|>" PARAMETER num_ctx 8192
I also tried with absolute path of model path
i am pushing a release at my fork for x86_64_cu_124 update: i have compiled a working version as a temporarily fix and uploaded to my fork, feel free to check it out https://github.com/luixiao0/ollama/releases Can you run minicpm-v 2.6 with Ollama? I build from your git ,but when create model get error like this
How to do with this?
try aiden_lu/minicpm-v2.6 please openBMB updated their GGUF recently, the file in aiden_lu/minicpm-v2.6 is quant from their earlier version of code, in my case it worked, you can get the same file from their hf history
in fact, i am getting the same error from the latest GGUF published by OpenBMB,but output from the older version should be same
check this out 😊,it runs with single RTX 3090, multiple GPU should work, but not tested in my machine
Looks great! Are you use the master or minicpmv-main branch of https://github.com/OpenBMB/llama.cpp.git? Or the official llama.cpp repo? I got Error: llama runner process has terminated: GGML_ASSERT(new_clip->has_llava_projector) failed when execute ollama run aiden_lu/minicpm-v2.6:Q4_K_M
check this out 😊,it runs with single RTX 3090, multiple GPU should work, but not tested in my machine
Looks great! Are you use the master or minicpmv-main branch of https://github.com/OpenBMB/llama.cpp.git? Or the official llama.cpp repo? I got
Error: llama runner process has terminated: GGML_ASSERT(new_clip->has_llava_projector) failedwhen executeollama run aiden_lu/minicpm-v2.6:Q4_K_M
as commented above sad that i am using the master branch from ollama and ggerganov, not OpenBMB as my starting point, then merge openBMB's work manually, at last modify the build Dockerfile to pull your branch, you may get a working copy of 'future' ollama
check this out 😊,it runs with single RTX 3090, multiple GPU should work, but not tested in my machine
Looks great! Are you use the master or minicpmv-main branch of https://github.com/OpenBMB/llama.cpp.git? Or the official llama.cpp repo? I got
Error: llama runner process has terminated: GGML_ASSERT(new_clip->has_llava_projector) failedwhen executeollama run aiden_lu/minicpm-v2.6:Q4_K_Mas commented above sad that i am using the master branch from ollama and ggerganov, not OpenBMB as my starting point, then merge openBMB's work manually, at last modify the build Dockerfile to pull your branch, you may get a working copy of 'future' ollama
Thanks for your comment! So the merged llama.cpp work is on repo https://github.com/luixiao0/llama.cpp, right?
check this out 😊,it runs with single RTX 3090, multiple GPU should work, but not tested in my machine
Looks great! Are you use the master or minicpmv-main branch of https://github.com/OpenBMB/llama.cpp.git? Or the official llama.cpp repo? I got
Error: llama runner process has terminated: GGML_ASSERT(new_clip->has_llava_projector) failedwhen executeollama run aiden_lu/minicpm-v2.6:Q4_K_Mas commented above sad that i am using the master branch from ollama and ggerganov, not OpenBMB as my starting point, then merge openBMB's work manually, at last modify the build Dockerfile to pull your branch, you may get a working copy of 'future' ollama
Thanks for your comment! So the merged llama.cpp work is on repo https://github.com/luixiao0/llama.cpp, right?
yeah, feel free to check my fork https://github.com/luixiao0/ollama you can find the llama.cpp module is redirected to my fork of llama.cpp i update the cmake/go/cuda version and change the build docker from centos to ubuntu( i find centos7 build wont pass due to package tools EOL) the Dockerfile is a little bit messed up since i dont need to build rocm or arm, you can uncomment them
check this out 😊,it runs with single RTX 3090, multiple GPU should work, but not tested in my machine
Looks great! Are you use the master or minicpmv-main branch of https://github.com/OpenBMB/llama.cpp.git? Or the official llama.cpp repo? I got
Error: llama runner process has terminated: GGML_ASSERT(new_clip->has_llava_projector) failedwhen executeollama run aiden_lu/minicpm-v2.6:Q4_K_M
I've tried your solution, but I'm still encountering the same error. Whether I build the container using a Dockerfile or directly build and run the model, the issue remains. It's quite strange, as the code should be identical in both scenarios. I'm using the code from the repositories https://github.com/luixiao0/ollama and https://github.com/luixiao0/llama.cpp, and the build process completes without any errors.
check this out 😊,it runs with single RTX 3090, multiple GPU should work, but not tested in my machine
Looks great! Are you use the master or minicpmv-main branch of https://github.com/OpenBMB/llama.cpp.git? Or the official llama.cpp repo? I got
Error: llama runner process has terminated: GGML_ASSERT(new_clip->has_llava_projector) failedwhen executeollama run aiden_lu/minicpm-v2.6:Q4_K_MI've tried your solution, but I'm still encountering the same error. Whether I build the container using a Dockerfile or directly build and run the model, the issue remains. It's quite strange, as the code should be identical in both scenarios. I'm using the code from the repositories https://github.com/luixiao0/ollama and https://github.com/luixiao0/llama.cpp, and the build process completes without any errors.
try run git submodule update --init --recursive before compile
How to do with this?
How to do with this?
check this out 😊,it runs with single RTX 3090, multiple GPU should work, but not tested in my machine