flash attention enabled but not supported by gpu
Describe the bug Title.
How to reproduce Steps to reproduce the error:
- launch ipex-llm's ollama
- run a model (in this case, unsloth deepseek
- get that message in logs
Screenshots If applicable, add screenshots to help explain the problem
Environment information
-----------------------------------------------------------------
PYTHON_VERSION=3.11.13
-----------------------------------------------------------------
/usr/local/lib/python3.11/dist-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_torch_pytree._register_pytree_node(
transformers=4.36.2
-----------------------------------------------------------------
torch=2.2.0+cu121
-----------------------------------------------------------------
ipex-llm Version: 2.3.0b20250629
-----------------------------------------------------------------
IPEX is not installed.
-----------------------------------------------------------------
CPU Information:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Vendor ID: GenuineIntel
Model name: 12th Gen Intel(R) Core(TM) i5-12500H
CPU family: 6
Model: 154
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 1
Stepping: 3
CPU max MHz: 4500.0000
CPU min MHz: 400.0000
BogoMIPS: 6220.80
-----------------------------------------------------------------
Total CPU Memory: 23.1682 GB
Memory Type: sudo: dmidecode: command not found
-----------------------------------------------------------------
Operating System:
Ubuntu 22.04.5 LTS \n \l
-----------------------------------------------------------------
Linux 05a8037651d0 6.15.3 #1-NixOS SMP PREEMPT_DYNAMIC Thu Jun 19 13:41:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
-----------------------------------------------------------------
/mnt/env-check.sh: line 148: xpu-smi: command not found
-----------------------------------------------------------------
/mnt/env-check.sh: line 154: clinfo: command not found
-----------------------------------------------------------------
Driver related package version:
ii intel-level-zero-gpu 1.6.32224.5 amd64 Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii intel-level-zero-gpu-legacy1 1.3.30872.22 amd64 Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii level-zero-devel 1.20.2 amd64 oneAPI Level Zero
-----------------------------------------------------------------
igpu not detected
-----------------------------------------------------------------
xpu-smi is not installed. Please install xpu-smi according to README.md
Hmmm, weird. The iGPU does get used, though.
Additional context Add any other context about the problem here.
Hi @stereomato , I don't quite understand your problem, could you please provide us with detailed running log and the messages you mentioned ?
I enable OLLAMA_FLASH_ATTENTION because i wanna use kv cache (q8) to speed things up, but i got that warning. Laptop's got an Iris Xe Graphics from an alder lake laptop
Hi @stereomato , we do not have support for OLLAMA_FLASH_ATTENTION yet.
Hi @stereomato , if you want to use fp8 quantized kv cache, you could try export IPEX_LLM_QUANTIZE_KV_CACHE=1 before ./ollama serve . It might work for models running with llamarunner.
how do i know if it works?
Take ./ollama run qwen3 for exmaple, original kv output looks like:
llama_kv_cache_unified: kv_size = 4096, type_k = 'f16', type_v = 'f16', n_layer = 36, can_shift = 1, padding = 32
llama_kv_cache_unified: SYCL0 KV buffer size = 576.00 MiB
llama_kv_cache_unified: KV self size = 576.00 MiB, K (f16): 288.00 MiB, V (f16): 288.00 MiB
If export IPEX_LLM_QUANTIZE_KV_CACHE=1 works, the output looks like:
llama_kv_cache_unified: kv_size = 4096, type_k = 'i8', type_v = 'i8', n_layer = 36, can_shift = 1, padding = 32
llama_kv_cache_unified: SYCL0 KV buffer size = 288.00 MiB
llama_kv_cache_unified: KV self size = 288.00 MiB, K (i8): 144.00 MiB, V (i8): 144.00 MiB
alright, I do get that output, but I also get a warning about "This model is not recommended to use quantize kv cache!" with unsloth qwen3 (4B, q4_k_m) and unsloth deepseek r1 0528 (8b, q4_k_m)
Yeah I just took qwen3 for an example. Actually for such models with grouped query attention, quantized kv does not bring obvious benefit .