ipex-llm icon indicating copy to clipboard operation
ipex-llm copied to clipboard

ollama + deepseek v2: The number of work-items in each dimension of a work-group cannot exceed {512, 512, 512} for this device

Open ghost opened this issue 10 months ago • 8 comments

The number of work-items in each dimension of a work-group cannot exceed {512, 512, 512} for this device Exception caught at file:/home/runner/_work/llm.cpp/llm.cpp/ollama-llama-cpp/ggml/src/ggml-sycl/ggml-sycl.cpp, line:4463

using this container, running on NixOS https://github.com/mattcurf/ollama-intel-gpu

podman build -t "ollama-intel-gpu" .

podman run --rm -p 127.0.0.1:11434:11434 -v /home/stereomato/models:/mnt -v ollama-volume:/root/.ollama -e OLLAMA_NUM_PARALLEL=1 -e OLLAMA_MAX_LOADED_MODELS=1 -e OLLAMA_FLASH_ATTENTION=1 -e OLLAMA_NUM_GPU=999 -e DEVICE=iGPU --device /dev/dri --name=ollama-intel-gpu

podman exec -it ollama-intel-gpu bash

./ollama pull deepseek-v2:16b, but the q4_k_m 16b also exhibits the same issue

./ollama run deepseek-v2 "hello deepseek"

Then, I get the error in the title/first two lines of this bug report.

HW: Intel i5-12500h, Intel Xe Graphics (Alder Lake) 24GB of RAM up to date NixOS

ghost avatar Feb 17 '25 18:02 ghost

nvm, this seems to be a memory limitation, derp. Is there a way to work around this?

ghost avatar Feb 17 '25 18:02 ghost

You can try to tune OLLAMA_NUM_GPU=999, like OLLAMA_NUM_GPU=18. It means put 18 layers on GPU, rest layers on CPU.

qiuxin2012 avatar Feb 18 '25 00:02 qiuxin2012

I am facing the similar issue while running ollama with deepseek-coder-v2 16b and olmoe 7b, both are mixture-of-experts (MoE) code language model,

The number of work-items in each dimension of a work-group cannot exceed {512, 512, 512} for this deviceException caught at file:/home/runner/_work/llm.cpp/llm.cpp/ollama-llama-cpp/ggml/src/ggml-sycl/ggml-sycl.cpp, line:4467

However, I am able to run deepseek-coder 33b fine on the same machine with Xe Graphics and 64GB RAM. Seem like MoE is yet to be verified, any plan to confirm it?

UPDATED: I try the ./flash-moe instead of ./llama-cli on these MoE model and not getting the error message now:

  • olmoe work flawlessly with igpu
  • deepseek-coder-v2 return corrupted output with igpu, fine if set -ngl 0 (cpu only)

Will llamacpp and ollama support in future?

ytliew82 avatar Mar 15 '25 15:03 ytliew82

This is weird, I had a model be able to use a lot of memory, but with a MoE (10b) I get this? Weird.

ghost avatar Jun 14 '25 16:06 ghost

still this error in latest version?

Ellie-Williams-007 avatar Jun 16 '25 01:06 Ellie-Williams-007

qwen3:30b-a3b-instruct-2507-q4_K_M has the same issue is moe, and doesnt work, I really like this one, hopefully it gets fixed too. thanks.

edit: using set OLLAMA_SET_OT="exps=CPU" before running ollama serve, does fix the issue, but im not sure if it runs at full speed. thanks.

pepepaco avatar Aug 01 '25 04:08 pepepaco

qwen3:30b-a3b-instruct-2507-q4_K_M has the same issue is moe, and doesnt work, I really like this one, hopefully it gets fixed too. thanks.

edit: using set OLLAMA_SET_OT="exps=CPU" before running ollama serve, does fix the issue, but im not sure if it runs at full speed. thanks.

Thanks, same issue on UHD Graphics 770 and fixed with your solution. I have noticed that after setting this env, most weights are loaded in CPU so maybe it is not running at full speed.

fumiama avatar Sep 04 '25 06:09 fumiama

set OLLAMA_SET_OT="exps=CPU" will offload MoE weights to CPU. It will reduce GPU VRAM requirements and hurt performance in some cases.

qiyuangong avatar Sep 04 '25 07:09 qiyuangong