ollama-ipex-llm cannot support phi4-mini
phi4-mini model would fail to run on ollama-ipex-llm 2.2.0. The missing tensor error occurs when loading model. Here're the error messages on both ollama cli and service log.
C:\Tools\ollama-ipex-llm>ollama run phi4-mini --verbose
ggml_sycl_init: found 1 SYCL devices:
Error: llama runner process has terminated: error loading model: missing tensor 'output.weight'
llama_model_load: error loading model: missing tensor 'output.weight'
llama_load_model_from_file: failed to load model
panic: unable to load model: C:\Users\zhangz36\.ollama\models\blobs\sha256-3c168af1dea0a414299c7d9077e100ac763370e5a98b3c53801a958a47f0a5db
goroutine 19 [running]:
ollama/llama/runner.(*Server).loadModel(0xc0000f6090, {0x3e7, 0x0, 0x0, 0x0, {0x0, 0x0, 0x0}, 0xc00045c1b0, 0x0}, ...)
ollama/llama/runner/runner.go:861 +0x4ee
created by ollama/llama/runner.Execute in goroutine 1
ollama/llama/runner/runner.go:1001 +0xd0d
time=2025-04-14T10:04:09.288+08:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: error loading model: missing tensor 'output.weight'"
[GIN] 2025/04/14 - 10:04:09 | 500 | 1.4318524s | 127.0.0.1 | POST "/api/generate"
The most possible cause is the upstream Ollama version which ollama-ipex-llm 2.2.0 integrates. On the phi4-mini model page, it mentioned that Note: this model requires Ollama 0.5.13 or later.
Does ollama-ipex-llm have plan to integrate the latest Ollama? Thanks!
https://github.com/chnxq/ollama/tree/chnxq/add-oneapi
You can try the above one. It uses the latest ollama version, but it has only been tested on Windows. ref: /llama/README-Intel-OneApi.md
https://github.com/chnxq/ollama/tree/chnxq/add-oneapi https://github.com/chnxq/ollama/releases/tag/chnxq%2Fv0.0.1a https://github.com/chnxq/ollama/releases/download/chnxq%2Fv0.0.1a/ollama_oneapi_win_x64_20250415.zip
requires runtime variables - then strangely report "looking for compatible GPUs" - but does not report actually having found intel - and only runs (but does run) if you use : ollama-intel-gpu.bat
there is a difference though... probably only if you have internal UHD graphics paired up with a couple of intel A700... then ipex-ollama with "set ONEAPI_DEVICE_SELECTOR=level_zero:0,1" will detect full 2x GPUs, while chnxq will go with only CPU and 1x A770. But if you disable on-cpu UHD graphics then they work the same.
Thank you guys! I will try these Ollama variants with OneAPI.
Yes, there may be problems with multiple graphics cards. I only managed to run it on my laptop, without other test conditions.
ollama-intel-gpu.bat in zip file is used to show which environment variables need to be configured.
thanks chnxq - great job regardless - ollama runs actually - but in general I find very little information, but. WITH internal UHD graphics enabled and using intel Arc 770 GPUs only for handling llm model, I find a high performance boost when serving multiple simultaneous clients. Thats why internal GPU is important for me.
https://github.com/chnxq/ollama/tree/chnxq/add-oneapi
You can try the above one. It uses the latest ollama version, but it has only been tested on Windows. ref: /llama/README-Intel-OneApi.m
This is great! thank you for doing this!!!!
https://github.com/chnxq/ollama/tree/chnxq/add-oneapi
You can try the above one. It uses the latest ollama version, but it has only been tested on Windows. ref: /llama/README-Intel-OneApi.md
many thx for such nice job, i have tested on debian13 linux
- ./ollama serve
time=2025-04-18T07:58:19.367+08:00 level=INFO source=routes.go:1299 msg="Listening on [::]:11434 (version 0.0.0)" time=2025-04-18T07:58:19.367+08:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-04-18T07:58:19.392+08:00 level=INFO source=types.go:130 msg="inference compute" id=0 library=oneapi variant="" compute="" driver=0.0 name="Intel(R) Arc(TM) Graphics" total="0 B" available="0 B"
- using application (like cherry studio) or run ./ollama run gemma3:12b then SYCL device could be activated
Running with Environment Variables: GGML_SYCL_DEBUG: 0 GGML_SYCL_DISABLE_OPT: 1 GGML_SYCL_DISABLE_GRAPH: 1 Build with Macros: GGML_SYCL_FORCE_MMQ: no GGML_SYCL_F16: no Found 1 SYCL devices: | | | | |Max | |Max |Global | | | | | | |compute|Max work|sub |mem | |
| ID | Device Type | Name | Version | units | group | group | size | Driver version |
|---|---|---|---|---|---|---|---|---|
| 0 | [level_zero:gpu:0] | Intel Arc Graphics | 12.71 | 128 | 1024 | 32 | 46899M | 1.6.32567+18 |
| SYCL Optimization Feature: | ||||||||
| ID | Device Type | Reorder | ||||||
| -- | ------------------- | ------- | ||||||
| 0 | [level_zero:gpu:0] | Y |
gemma3(vision) like 12B could be downloaded, and vision function is working as well.
- phi4-mini could be pulled , and working well on sycl device
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = phi3 llama_model_loader: - kv 1: phi3.rope.scaling.attn_factor f32 = 1.190238 llama_model_loader: - kv 2: general.type str = model llama_model_loader: - kv 3: general.name str = Phi 4 Mini Instruct llama_model_loader: - kv 4: general.finetune str = instruct llama_model_loader: - kv 5: general.basename str = Phi-4 llama_model_loader: - kv 6: general.size_label str = mini
- in my case, i have to source /opt/intel/oneapi/setvars.sh, otherwise ollama works only on CPU device.
- before go build -o ollama, DONT FORGET TO DO belows in llm/server.go file. case runtime.GOOS == "windows" && gpus[0].Library == "oneapi" && estimate.Layers == 0: //delete "runtime.GOOS == "windows" && " when test for linux
by the way ./ollama --version shows up 0.0.0 :)