ollama + deepseek v2: The number of work-items in each dimension of a work-group cannot exceed {512, 512, 512} for this device
The number of work-items in each dimension of a work-group cannot exceed {512, 512, 512} for this device Exception caught at file:/home/runner/_work/llm.cpp/llm.cpp/ollama-llama-cpp/ggml/src/ggml-sycl/ggml-sycl.cpp, line:4463
using this container, running on NixOS https://github.com/mattcurf/ollama-intel-gpu
podman build -t "ollama-intel-gpu" .
podman run --rm -p 127.0.0.1:11434:11434 -v /home/stereomato/models:/mnt -v ollama-volume:/root/.ollama -e OLLAMA_NUM_PARALLEL=1 -e OLLAMA_MAX_LOADED_MODELS=1 -e OLLAMA_FLASH_ATTENTION=1 -e OLLAMA_NUM_GPU=999 -e DEVICE=iGPU --device /dev/dri --name=ollama-intel-gpu
podman exec -it ollama-intel-gpu bash
./ollama pull deepseek-v2:16b, but the q4_k_m 16b also exhibits the same issue
./ollama run deepseek-v2 "hello deepseek"
Then, I get the error in the title/first two lines of this bug report.
HW: Intel i5-12500h, Intel Xe Graphics (Alder Lake) 24GB of RAM up to date NixOS
nvm, this seems to be a memory limitation, derp. Is there a way to work around this?
You can try to tune OLLAMA_NUM_GPU=999, like OLLAMA_NUM_GPU=18. It means put 18 layers on GPU, rest layers on CPU.
I am facing the similar issue while running ollama with deepseek-coder-v2 16b and olmoe 7b, both are mixture-of-experts (MoE) code language model,
The number of work-items in each dimension of a work-group cannot exceed {512, 512, 512} for this deviceException caught at file:/home/runner/_work/llm.cpp/llm.cpp/ollama-llama-cpp/ggml/src/ggml-sycl/ggml-sycl.cpp, line:4467
However, I am able to run deepseek-coder 33b fine on the same machine with Xe Graphics and 64GB RAM. Seem like MoE is yet to be verified, any plan to confirm it?
UPDATED: I try the ./flash-moe instead of ./llama-cli on these MoE model and not getting the error message now:
- olmoe work flawlessly with igpu
- deepseek-coder-v2 return corrupted output with igpu, fine if set -ngl 0 (cpu only)
Will llamacpp and ollama support in future?
This is weird, I had a model be able to use a lot of memory, but with a MoE (10b) I get this? Weird.
still this error in latest version?
qwen3:30b-a3b-instruct-2507-q4_K_M has the same issue is moe, and doesnt work, I really like this one, hopefully it gets fixed too. thanks.
edit: using set OLLAMA_SET_OT="exps=CPU" before running ollama serve, does fix the issue, but im not sure if it runs at full speed. thanks.
qwen3:30b-a3b-instruct-2507-q4_K_M has the same issue is moe, and doesnt work, I really like this one, hopefully it gets fixed too. thanks.
edit: using set OLLAMA_SET_OT="exps=CPU" before running ollama serve, does fix the issue, but im not sure if it runs at full speed. thanks.
Thanks, same issue on UHD Graphics 770 and fixed with your solution. I have noticed that after setting this env, most weights are loaded in CPU so maybe it is not running at full speed.
set OLLAMA_SET_OT="exps=CPU" will offload MoE weights to CPU. It will reduce GPU VRAM requirements and hurt performance in some cases.