ipex-llm ollama + deepseek v2: The number of work-items in each dimension of a work-group cannot exceed {512, 512, 512} for this device

The number of work-items in each dimension of a work-group cannot exceed {512, 512, 512} for this device Exception caught at file:/home/runner/_work/llm.cpp/llm.cpp/ollama-llama-cpp/ggml/src/ggml-sycl/ggml-sycl.cpp, line:4463

using this container, running on NixOS https://github.com/mattcurf/ollama-intel-gpu

podman build -t "ollama-intel-gpu" .

podman run --rm -p 127.0.0.1:11434:11434 -v /home/stereomato/models:/mnt -v ollama-volume:/root/.ollama -e OLLAMA_NUM_PARALLEL=1 -e OLLAMA_MAX_LOADED_MODELS=1 -e OLLAMA_FLASH_ATTENTION=1 -e OLLAMA_NUM_GPU=999 -e DEVICE=iGPU --device /dev/dri --name=ollama-intel-gpu

podman exec -it ollama-intel-gpu bash

./ollama pull deepseek-v2:16b, but the q4_k_m 16b also exhibits the same issue

./ollama run deepseek-v2 "hello deepseek"

Then, I get the error in the title/first two lines of this bug report.

HW: Intel i5-12500h, Intel Xe Graphics (Alder Lake) 24GB of RAM up to date NixOS

Feb 17 '25 18:02 ghost

nvm, this seems to be a memory limitation, derp. Is there a way to work around this?

Feb 17 '25 18:02 ghost

You can try to tune OLLAMA_NUM_GPU=999, like OLLAMA_NUM_GPU=18. It means put 18 layers on GPU, rest layers on CPU.

Feb 18 '25 00:02 qiuxin2012

I am facing the similar issue while running ollama with deepseek-coder-v2 16b and olmoe 7b, both are mixture-of-experts (MoE) code language model,

The number of work-items in each dimension of a work-group cannot exceed {512, 512, 512} for this deviceException caught at file:/home/runner/_work/llm.cpp/llm.cpp/ollama-llama-cpp/ggml/src/ggml-sycl/ggml-sycl.cpp, line:4467

However, I am able to run deepseek-coder 33b fine on the same machine with Xe Graphics and 64GB RAM. Seem like MoE is yet to be verified, any plan to confirm it?

UPDATED: I try the ./flash-moe instead of ./llama-cli on these MoE model and not getting the error message now:

olmoe work flawlessly with igpu
deepseek-coder-v2 return corrupted output with igpu, fine if set -ngl 0 (cpu only)

Will llamacpp and ollama support in future?

Mar 15 '25 15:03 ytliew82

This is weird, I had a model be able to use a lot of memory, but with a MoE (10b) I get this? Weird.

Jun 14 '25 16:06 ghost

still this error in latest version?

Jun 16 '25 01:06 Ellie-Williams-007

qwen3:30b-a3b-instruct-2507-q4_K_M has the same issue is moe, and doesnt work, I really like this one, hopefully it gets fixed too. thanks.

edit: using set OLLAMA_SET_OT="exps=CPU" before running ollama serve, does fix the issue, but im not sure if it runs at full speed. thanks.

Aug 01 '25 04:08 pepepaco

qwen3:30b-a3b-instruct-2507-q4_K_M has the same issue is moe, and doesnt work, I really like this one, hopefully it gets fixed too. thanks.

edit: using set OLLAMA_SET_OT="exps=CPU" before running ollama serve, does fix the issue, but im not sure if it runs at full speed. thanks.

Thanks, same issue on UHD Graphics 770 and fixed with your solution. I have noticed that after setting this env, most weights are loaded in CPU so maybe it is not running at full speed.

Sep 04 '25 06:09 fumiama

set OLLAMA_SET_OT="exps=CPU" will offload MoE weights to CPU. It will reduce GPU VRAM requirements and hurt performance in some cases.

Sep 04 '25 07:09 qiyuangong