ipex-llm icon indicating copy to clipboard operation
ipex-llm copied to clipboard

ollama-ipex-llm cannot support phi4-mini

Open thinkdoggie opened this issue 8 months ago • 7 comments

phi4-mini model would fail to run on ollama-ipex-llm 2.2.0. The missing tensor error occurs when loading model. Here're the error messages on both ollama cli and service log.

C:\Tools\ollama-ipex-llm>ollama run phi4-mini --verbose
ggml_sycl_init: found 1 SYCL devices:
Error: llama runner process has terminated: error loading model: missing tensor 'output.weight'
llama_model_load: error loading model: missing tensor 'output.weight'
llama_load_model_from_file: failed to load model
panic: unable to load model: C:\Users\zhangz36\.ollama\models\blobs\sha256-3c168af1dea0a414299c7d9077e100ac763370e5a98b3c53801a958a47f0a5db

goroutine 19 [running]:
ollama/llama/runner.(*Server).loadModel(0xc0000f6090, {0x3e7, 0x0, 0x0, 0x0, {0x0, 0x0, 0x0}, 0xc00045c1b0, 0x0}, ...)
        ollama/llama/runner/runner.go:861 +0x4ee
created by ollama/llama/runner.Execute in goroutine 1
        ollama/llama/runner/runner.go:1001 +0xd0d
time=2025-04-14T10:04:09.288+08:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: error loading model: missing tensor 'output.weight'"
[GIN] 2025/04/14 - 10:04:09 | 500 |    1.4318524s |       127.0.0.1 | POST     "/api/generate"

The most possible cause is the upstream Ollama version which ollama-ipex-llm 2.2.0 integrates. On the phi4-mini model page, it mentioned that Note: this model requires Ollama 0.5.13 or later.

Does ollama-ipex-llm have plan to integrate the latest Ollama? Thanks!

thinkdoggie avatar Apr 14 '25 02:04 thinkdoggie

https://github.com/chnxq/ollama/tree/chnxq/add-oneapi

You can try the above one. It uses the latest ollama version, but it has only been tested on Windows. ref: /llama/README-Intel-OneApi.md

chnxq avatar Apr 14 '25 17:04 chnxq

https://github.com/chnxq/ollama/tree/chnxq/add-oneapi https://github.com/chnxq/ollama/releases/tag/chnxq%2Fv0.0.1a https://github.com/chnxq/ollama/releases/download/chnxq%2Fv0.0.1a/ollama_oneapi_win_x64_20250415.zip

requires runtime variables - then strangely report "looking for compatible GPUs" - but does not report actually having found intel - and only runs (but does run) if you use : ollama-intel-gpu.bat

there is a difference though... probably only if you have internal UHD graphics paired up with a couple of intel A700... then ipex-ollama with "set ONEAPI_DEVICE_SELECTOR=level_zero:0,1" will detect full 2x GPUs, while chnxq will go with only CPU and 1x A770. But if you disable on-cpu UHD graphics then they work the same.

stigva avatar Apr 15 '25 17:04 stigva

Thank you guys! I will try these Ollama variants with OneAPI.

thinkdoggie avatar Apr 16 '25 02:04 thinkdoggie

Yes, there may be problems with multiple graphics cards. I only managed to run it on my laptop, without other test conditions.

ollama-intel-gpu.bat in zip file is used to show which environment variables need to be configured.

chnxq avatar Apr 16 '25 09:04 chnxq

thanks chnxq - great job regardless - ollama runs actually - but in general I find very little information, but. WITH internal UHD graphics enabled and using intel Arc 770 GPUs only for handling llm model, I find a high performance boost when serving multiple simultaneous clients. Thats why internal GPU is important for me.

stigva avatar Apr 16 '25 17:04 stigva

https://github.com/chnxq/ollama/tree/chnxq/add-oneapi

You can try the above one. It uses the latest ollama version, but it has only been tested on Windows. ref: /llama/README-Intel-OneApi.m

This is great! thank you for doing this!!!!

https://github.com/chnxq/ollama/tree/chnxq/add-oneapi

You can try the above one. It uses the latest ollama version, but it has only been tested on Windows. ref: /llama/README-Intel-OneApi.md

many thx for such nice job, i have tested on debian13 linux

  1. ./ollama serve

time=2025-04-18T07:58:19.367+08:00 level=INFO source=routes.go:1299 msg="Listening on [::]:11434 (version 0.0.0)" time=2025-04-18T07:58:19.367+08:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-04-18T07:58:19.392+08:00 level=INFO source=types.go:130 msg="inference compute" id=0 library=oneapi variant="" compute="" driver=0.0 name="Intel(R) Arc(TM) Graphics" total="0 B" available="0 B"

  1. using application (like cherry studio) or run ./ollama run gemma3:12b then SYCL device could be activated

Running with Environment Variables: GGML_SYCL_DEBUG: 0 GGML_SYCL_DISABLE_OPT: 1 GGML_SYCL_DISABLE_GRAPH: 1 Build with Macros: GGML_SYCL_FORCE_MMQ: no GGML_SYCL_F16: no Found 1 SYCL devices: | | | | |Max | |Max |Global | | | | | | |compute|Max work|sub |mem | |

ID Device Type Name Version units group group size Driver version
0 [level_zero:gpu:0] Intel Arc Graphics 12.71 128 1024 32 46899M 1.6.32567+18
SYCL Optimization Feature:
ID Device Type Reorder
-- ------------------- -------
0 [level_zero:gpu:0] Y

gemma3(vision) like 12B could be downloaded, and vision function is working as well.

  1. phi4-mini could be pulled , and working well on sycl device

llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = phi3 llama_model_loader: - kv 1: phi3.rope.scaling.attn_factor f32 = 1.190238 llama_model_loader: - kv 2: general.type str = model llama_model_loader: - kv 3: general.name str = Phi 4 Mini Instruct llama_model_loader: - kv 4: general.finetune str = instruct llama_model_loader: - kv 5: general.basename str = Phi-4 llama_model_loader: - kv 6: general.size_label str = mini

  1. in my case, i have to source /opt/intel/oneapi/setvars.sh, otherwise ollama works only on CPU device.
  2. before go build -o ollama, DONT FORGET TO DO belows in llm/server.go file. case runtime.GOOS == "windows" && gpus[0].Library == "oneapi" && estimate.Layers == 0: //delete "runtime.GOOS == "windows" && " when test for linux

by the way ./ollama --version shows up 0.0.0 :)

WayneJin88888 avatar Apr 18 '25 00:04 WayneJin88888