ipex-llm ollama on windows nightly build/portable zip

Describe the bug On Windows with a B580 the Ollema build errors with flag provided but not defined: -ngl

How to reproduce Steps to reproduce the error:

Extract ollama-ipex-llm-2.3.0b20250429-win.zip
Run the start-ollama.bat file
Attempt to run a model E:\ollama-ipex-llm-win\ollama.exe run gemma3:4b-it-fp16

Screenshots

Environment information

2025/05/18 14:37:09 routes.go:1230: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY:localhost,127.0.0.1 OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:true OLLAMA_KEEP_ALIVE:10m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:5 OLLAMA_MODELS:E:\\ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[* http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"
time=2025-05-18T14:37:09.077+10:00 level=INFO source=images.go:432 msg="total blobs: 106"
time=2025-05-18T14:37:09.079+10:00 level=INFO source=images.go:439 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:   export GIN_MODE=release
 - using code:  gin.SetMode(gin.ReleaseMode)

[GIN-debug] HEAD   /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET    /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func3 (5 handlers)
[GIN-debug] GET    /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func4 (5 handlers)
[GIN-debug] POST   /api/pull                 --> github.com/ollama/ollama/server.(*Server).PullHandler-fm (5 handlers)
[GIN-debug] POST   /api/push                 --> github.com/ollama/ollama/server.(*Server).PushHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers)
[GIN-debug] GET    /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers)
[GIN-debug] POST   /api/show                 --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete               --> github.com/ollama/ollama/server.(*Server).DeleteHandler-fm (5 handlers)
[GIN-debug] POST   /api/create               --> github.com/ollama/ollama/server.(*Server).CreateHandler-fm (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] POST   /api/copy                 --> github.com/ollama/ollama/server.(*Server).CopyHandler-fm (5 handlers)
[GIN-debug] GET    /api/ps                   --> github.com/ollama/ollama/server.(*Server).PsHandler-fm (5 handlers)
[GIN-debug] POST   /api/generate             --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST   /api/chat                 --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST   /api/embed                --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers)
[GIN-debug] POST   /api/embeddings           --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] POST   /v1/completions           --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers)
[GIN-debug] POST   /v1/embeddings            --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models                --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models/:model         --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers)
time=2025-05-18T14:37:09.082+10:00 level=INFO source=routes.go:1297 msg="Listening on [::]:11434 (version 0.0.0)"
time=2025-05-18T14:37:09.082+10:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-05-18T14:37:09.082+10:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2025-05-18T14:37:09.082+10:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=6 efficiency=0 threads=12
time=2025-05-18T14:37:09.267+10:00 level=INFO source=amd_hip_windows.go:103 msg="AMD ROCm reports no devices found"
time=2025-05-18T14:37:09.267+10:00 level=INFO source=amd_windows.go:49 msg="no compatible amdgpu devices detected"
time=2025-05-18T14:37:09.271+10:00 level=INFO source=types.go:130 msg="inference compute" id=0 library=oneapi variant="" compute="" driver=0.0 name="\xc0" total="11.8 GiB" available="9.5 GiB"
[GIN] 2025/05/18 - 14:37:19 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/05/18 - 14:37:19 | 200 |     30.8947ms |       127.0.0.1 | POST     "/api/show"
time=2025-05-18T14:37:19.863+10:00 level=INFO source=server.go:107 msg="system memory" total="31.1 GiB" free="23.2 GiB" free_swap="18.4 GiB"
time=2025-05-18T14:37:19.866+10:00 level=INFO source=server.go:154 msg=offload library=oneapi layers.requested=-1 layers.model=35 layers.offload=31 layers.split="" memory.available="[9.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.3 GiB" memory.required.partial="9.4 GiB" memory.required.kv="1.1 GiB" memory.required.allocations="[9.4 GiB]" memory.weights.total="6.0 GiB" memory.weights.repeating="6.0 GiB" memory.weights.nonrepeating="1.3 GiB" memory.graph.full="517.0 MiB" memory.graph.partial="1.0 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB"
time=2025-05-18T14:37:19.925+10:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-05-18T14:37:19.926+10:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
time=2025-05-18T14:37:19.928+10:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-05-18T14:37:19.931+10:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
time=2025-05-18T14:37:19.931+10:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
time=2025-05-18T14:37:19.931+10:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
time=2025-05-18T14:37:19.931+10:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
time=2025-05-18T14:37:19.931+10:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
time=2025-05-18T14:37:19.937+10:00 level=INFO source=server.go:430 msg="starting llama server" cmd="E:\\ollama-ipex-llm-win\\ollama-lib.exe runner --ollama-engine --model E:\\ollama\\models\\blobs\\sha256-2e1715faf889527461e76d271e827bbe03f3d22b4b86acf6146671d72eb6d11d --ctx-size 8192 --batch-size 512 -ngl 999 --threads 6 --parallel 1 --port 60759"
time=2025-05-18T14:37:19.939+10:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
time=2025-05-18T14:37:19.939+10:00 level=INFO source=server.go:605 msg="waiting for llama runner to start responding"
time=2025-05-18T14:37:19.940+10:00 level=INFO source=server.go:639 msg="waiting for server to become available" status="llm server error"
flag provided but not defined: -ngl
Runner usage
  -batch-size int
        Batch size (default 512)
  -ctx-size int
        Context (or KV cache) size (default 2048)
  -flash-attn
        Enable flash attention
  -kv-cache-type string
        quantization type for KV cache (default: f16)
  -lora value
        Path to lora layer file (can be specified multiple times)
  -main-gpu int
        Main GPU
  -mlock
        force system to keep model in RAM rather than swapping or compressing
  -model string
        Path to model binary file
  -multiuser-cache
        optimize input cache algorithm for multiple users
  -n-gpu-layers int
        Number of layers to offload to GPU
  -no-mmap
        do not memory-map model (slower load but may reduce pageouts if not using mlock)
  -parallel int
        Number of sequences to handle simultaneously (default 1)
  -port int
        Port to expose the server on (default 8080)
  -tensor-split string
        fraction of the model to offload to each GPU, comma-separated list of proportions
  -threads int
        Number of threads to use during generation (default 12)
  -verbose
        verbose output (default: disabled)
time=2025-05-18T14:37:20.190+10:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: exit status 2"
[GIN] 2025/05/18 - 14:37:20 | 500 |    551.1771ms |       127.0.0.1 | POST     "/api/generate"

Additional Information

May 18 '25 04:05 publicarray

Hi @publicarray, we currently do not support Gemma3, and our support is still working on it. We recommend switching to other models such as Qwen3 or DeepSeek-R1 in the meantime.

May 19 '25 08:05 sgwhat

@sgwhat I was under the impression that the fp16 version was working: https://github.com/intel/ipex-llm/issues/13129#issuecomment-2848582750

May 20 '25 14:05 publicarray

The docs have also been updated to reflect this https://github.com/intel/ipex-llm/commit/6f634b41da5cd8bf6c17231c788248d3fa6345d6

May 20 '25 14:05 publicarray